Re: Creating Consistent Categories

newsas007 · Posted 08-23-2021 11:59 AM

Hi,

I have a dataset with categories in different order. I want to create a consistency in the order as [(i)cyt, (ii) end therapy, (iii) targ]. Also delete the (i),(ii), and (iii).

Different versions are [(ii)cyt, (iii) targ, (ii) end therapy], [(iii)end therapy, (i) cyt, (iii) targ], etc. Please see the data below. Thanks

Data Have:

ID	Categories
1	(i)cyt, (ii) end therapy, (iii) targ
2	(ii)cyt, (iii) targ, (ii) end therapy
3	(iii)end therapy, (i) cyt, (iii) targ
4	iii) targ, (i) cyt, (ii) end therapy
5	(iii)end therapy, (i) cyt, (iii) targ
6	iii) targ, (i) cyt, (ii) end therapy
7	(i)cyt, (ii) end therapy, (iii) targ
8	(ii)cyt, (iii) targ, (ii) end therapy
9	(ii)cyt
10	(iii) targ
11	(ii) end therapy

Data Want:

ID	Categories	Clean_Category
1	(i)cyt, (ii) end therapy, (iii) targ	Cyt, End Therapy, Targ
2	(ii)cyt, (iii) targ, (ii) end therapy	Cyt, End Therapy, Targ
3	(iii)end therapy, (i) cyt, (iii) targ	Cyt, End Therapy, Targ
4	iii) targ, (i) cyt, (ii) end therapy	Cyt, End Therapy, Targ
5	(iii)end therapy, (i) cyt, (iii) targ	Cyt, End Therapy, Targ
6	iii) targ, (i) cyt, (ii) end therapy	Cyt, End Therapy, Targ
7	(i)cyt, (ii) end therapy, (iii) targ	Cyt, End Therapy, Targ
8	(ii)cyt, (iii) targ, (ii) end therapy	Cyt, End Therapy, Targ
9	(ii)cyt	Cyt
10	(iii) targ	Targ
11	(ii) end therapy	End Therapy

data have;
   infile datalines dlm=':';
   input ID Categories :$100.;
 datalines;
1:(i)cyt,(ii)end therapy,(iii)targ
2:(ii)cyt,(iii)targ,(ii)end therapy
3:(iii)end therapy,(i)cyt,(iii)targ
4:iii)targ,(i)cyt,(ii)end therapy
5:(iii)end therapy,(i)cyt,(iii)targ
6:iii)targ,(i)cyt,(ii)end therapy
7:(i)cyt,(ii)end therapy,(iii)targ
8:(ii)cyt,(iii)targ,(ii)end therapy
9:(ii)cyt
10:(iii)targ
11:(ii)end therapy
;
Run;

ballardw · Posted 08-23-2021 12:21 PM

@newsas007 wrote:

Hi,

I have a dataset with categories in different order. I want to create a consistency in the order as [(i)cyt, (ii) end therapy, (iii) targ]. Also delete the (i),(ii), and (iii).

Different versions are [(ii)cyt, (iii) targ, (ii) end therapy], [(iii)end therapy, (i) cyt, (iii) targ], etc. Please see the data below. Thanks

Data Have:

ID Categories

1 (i)cyt, (ii) end therapy, (iii) targ

2 (ii)cyt, (iii) targ, (ii) end therapy

3 (iii)end therapy, (i) cyt, (iii) targ

4 iii) targ, (i) cyt, (ii) end therapy

5 (iii)end therapy, (i) cyt, (iii) targ

6 iii) targ, (i) cyt, (ii) end therapy

7 (i)cyt, (ii) end therapy, (iii) targ

8 (ii)cyt, (iii) targ, (ii) end therapy

9 (ii)cyt

10 (iii) targ

11 (ii) end therapy

Data Want:

ID Categories Clean_Category

1 (i)cyt, (ii) end therapy, (iii) targ Cyt, End Therapy, Targ

2 (ii)cyt, (iii) targ, (ii) end therapy Cyt, End Therapy, Targ

3 (iii)end therapy, (i) cyt, (iii) targ Cyt, End Therapy, Targ

4 iii) targ, (i) cyt, (ii) end therapy Cyt, End Therapy, Targ

5 (iii)end therapy, (i) cyt, (iii) targ Cyt, End Therapy, Targ

6 iii) targ, (i) cyt, (ii) end therapy Cyt, End Therapy, Targ

7 (i)cyt, (ii) end therapy, (iii) targ Cyt, End Therapy, Targ

8 (ii)cyt, (iii) targ, (ii) end therapy Cyt, End Therapy, Targ

9 (ii)cyt Cyt

10 (iii) targ Targ

11 (ii) end therapy End Therapy

I would start with something like this (untested code a I don't have a data set to work with)

data want;
   set have;
   length Clean_category $ 30;
   if index(categories,'cyt')>0 then Clean_category='CYT';
   if index(categories,'end therapy')>0 then Clean_category=catx(', ',Clean_category,'End Therapy');
   if index(categories,'targ')>0 then Clean_category=catx(', ',Clean_category,'Targ');
run;

newsas007 · Posted 08-23-2021 01:12 PM

Hi @ballardw

Thanks, I have added a code for the dataset.

ballardw · Posted 08-23-2021 01:39 PM

@newsas007 wrote:

Hi @ballardw

Thanks, I have added a code for the dataset.

My example code works with the data from that data set you provided.

PhilC · Posted 08-23-2021 03:48 PM

This turns out to be an educational problem for myself. There is a format for converting numbers to roman numerals, there is no such informat, switching back. Without a way to convert roman numerals to numbers, how will the categories be sorted? I haven't given up, but I did not read that this was mentioned, so now its mentioned.

but cty goes to (i) and targ to (iii) always, even if the numeral is wrong, so ok that means it doesn't matter.

Still, I think I should know how to read in roman numerals from a text file.

ballardw · Posted 08-23-2021 05:08 PM

@PhilC wrote:

This turns out to be an educational problem for myself. There is a format for converting numbers to roman numerals, there is no such informat, switching back. Without a way to convert roman numerals to numbers, how will the categories be sorted? I haven't given up, but I did not read that this was mentioned, so now its mentioned.

but cty goes to (i) and targ to (iii) always, even if the numeral is wrong, so ok that means it doesn't matter.

Still, I think I should know how to read in roman numerals from a text file.

What numbers? Your example data set does not contain any numbers.

Are you saying that your example data is not actually what your data looks like? If that is case maybe time to use the Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

Does the code I provided not provide the output requested for the given input example data set?

What sorting, you did not state there was any sorting, your example says that the values were in a desired order if present. So the Roman numeral portion is completely dropped.

From your original post: "in the order as [(i)cyt, (ii) end therapy, (iii) targ]. Also delete the (i),(ii), and (iii)."

If something is supposed to be a different order you need to provide input examples and the output.

If you are worried about the Roman numerals then you have to tell us how big of a Roman numeral is involved and exactly how the numeric value is to be used in the process because this is the first place you mention that has any impact.

PhilC · Posted 08-23-2021 08:21 PM

No, I'm just saying I tried to look and solve this problem, before reading
your discourse. I noticed the Roman numerals. At the time I was expecting
to have to sort the items based on the Roman numerals...

ballardw · Posted 08-24-2021 02:31 AM

@PhilC wrote:
No, I'm just saying I tried to look and solve this problem, before reading
your discourse. I noticed the Roman numerals. At the time I was expecting
to have to sort the items based on the Roman numerals...

Sorry, I thought that was the OP changing things on us.

You are describing an application of what my first statistics prof called "Juolas Metatherom" which states "You will seldom misunderstand a problem so as to make it easier." The professor's name was Dr. Juola ...

Kurt_Bremser · Posted 08-24-2021 04:50 AM

You can make use of the fact that your categories sort alphabetically:

data have;
   infile datalines dlm=':';
   input ID Categories :$100.;
 datalines;
1:(i)cyt,(ii)end therapy,(iii)targ
2:(ii)cyt,(iii)targ,(ii)end therapy
3:(iii)end therapy,(i)cyt,(iii)targ
4:iii)targ,(i)cyt,(ii)end therapy
5:(iii)end therapy,(i)cyt,(iii)targ
6:iii)targ,(i)cyt,(ii)end therapy
7:(i)cyt,(ii)end therapy,(iii)targ
8:(ii)cyt,(iii)targ,(ii)end therapy
9:(ii)cyt
10:(iii)targ
11:(ii)end therapy
;

data want1;
set have;
do i = 1 to countw(categories,",");
  category = strip(scan(scan(categories,i,","),2,")"));
  output;
end;
keep id category;
run;

proc sort data=want1;
by id category;
run;

data want2;
set want1;
by id;
retain categories;
length categories $100;
if first.id then categories = "";
categories = catx(",",categories,category);
if last.id;
keep id categories;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PhilC · Posted 08-24-2021 09:25 AM

before getting distracted by the roman numerals, I intended to make use of the SORT FUNCTION since I don't get to use it very much. Anyway, following my own "Juolas" Metatheory I did not want to assume the categories would be alphabetical. BUT maybe we can, and we can use SORT.:

data have;
   infile datalines dlm=':';
   input ID Categories :$100.;
 datalines;
1:(i)cyt,(ii)end therapy,(iii)targ
2:(ii)cyt,(iii)targ,(ii)end therapy
3:(iii)end therapy,(i)cyt,(iii)targ
4:iii)targ,(i)cyt,(ii)end therapy
5:(iii)end therapy,(i)cyt,(iii)targ
6:iii)targ,(i)cyt,(ii)end therapy
7:(i)cyt,(ii)end therapy,(iii)targ
8:(ii)cyt,(iii)targ,(ii)end therapy
9:(ii)cyt
10:(iii)targ
11:(ii)end therapy
;

data want1;
  set have;
  length cat1-cat3 $20;
    array cat[3] $ cat1-cat3;
  do i = 1 to 3;/*assuming 3 categories*/
    cat[i] = strip(scan(scan(categories,i,","),2,")"));
  end;
  _rc=sort(of cat1-cat3);
  category=catx(", ",of cat1-cat3);
  keep id category;
run;

Ksharp · Posted 08-24-2021 09:21 AM

data have;
   infile datalines dlm=':';
   input ID Categories :$100.;
 datalines;
1:(i)cyt,(ii)end therapy,(iii)targ
2:(ii)cyt,(iii)targ,(ii)end therapy
3:(iii)end therapy,(i)cyt,(iii)targ
4:iii)targ,(i)cyt,(ii)end therapy
5:(iii)end therapy,(i)cyt,(iii)targ
6:iii)targ,(i)cyt,(ii)end therapy
7:(i)cyt,(ii)end therapy,(iii)targ
8:(ii)cyt,(iii)targ,(ii)end therapy
9:(ii)cyt
10:(iii)targ
11:(ii)end therapy
;

data want;
 set have;
 want=prxchange('s/\(.*?\)//',-1,Categories);
 want2=prxchange('s/.*\)//',-1,want);
run;

Registration is open

SAS Training: Just a Click Away