How do I create an identifier variable for all possible combinations?

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

How do I create an identifier variable for all possible combinations?

I have a set of 20 variables in a dataset, each variable with somewhere between two to six levels of values.

 

Example:

data trial; 
input ageg1 ses1 byear1 race1 preterm1 sga1 stillbirth1 obesity1
interval2 ageg2 preterm2 sga2 stillbirth2 obesity2 interval3 
ageg3 preterm3 sga3 stillbirth3 obesity3;
datalines;
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
2	2	2	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	3	3	3	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
4	2	4	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
6	3	1	5	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
3	1	2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
run;

 

I need to create a new variable (named 'combo') with a unique code identifying each potential combination of these 20 variables. For example, '1' would indicate that var1=1, var2=2, var3=1, etc. and '2' would indicate var1=1, var2=2, var3=2, etc. 

 

I then need to determine the number of observations falling within each of those combinations.

 

I know that an array and summary statement should be able to accomplish this, but struggling with how exactly to write this. 

 

Any advice would be helpful.


Accepted Solutions
Solution
‎12-19-2016 07:27 AM
Super User
Posts: 5,509

Re: How do I create an identifier variable for all possible combinations?

It's probably easier to create the actual set of codes, rather than to map a number to the set.  For example:

 

data want;

set have;

combo = put(ageg1, 1.) || put(ses1, 1.) || put(byyear1, 1.) || ... || put(stillbirth3, 1.) || put(obesity3, 1.);

run;

 

proc freq data=want;

tables combo;

run;

 

It's even easier if you are going back to the raw data, and using the INPUT statement.  If that INPUT statement truly reflects the contents of your incoming data, you could add to that DATA step:

 

length combo $ 20;

combo = compress(_infile_);

 

View solution in original post


All Replies
Trusted Advisor
Posts: 1,568

Re: How do I create an identifier variable for all possible combinations?

[ Edited ]

Try next code:

 

%let vars = 

ageg1 ses1 byear1 race1 preterm1 sga1 stillbirth1 obesity1
interval2 ageg2 preterm2 sga2 stillbirth2 obesity2 interval3 
ageg3 preterm3 sga3 stillbirth3 obesity3;

data trial; 
input &vars;
datalines;
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
2	2	2	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	3	3	3	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
4	2	4	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
6	3	1	5	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
3	1	2	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
run;

proc sort data=trial;
  by &vars;
run;

data want;
  set trial;
  by &vars;
  retain combo 0;
  if first.obesity3 /* last variable in list &vars */ then combo +1;
run;

proc freq data=want;
  table combo;
run;

 

 You can add options to proc freq  to create sas datadset output beyond or instead the print out.

 You even can define to calculate/show percantage or not.

 

 

 

Solution
‎12-19-2016 07:27 AM
Super User
Posts: 5,509

Re: How do I create an identifier variable for all possible combinations?

It's probably easier to create the actual set of codes, rather than to map a number to the set.  For example:

 

data want;

set have;

combo = put(ageg1, 1.) || put(ses1, 1.) || put(byyear1, 1.) || ... || put(stillbirth3, 1.) || put(obesity3, 1.);

run;

 

proc freq data=want;

tables combo;

run;

 

It's even easier if you are going back to the raw data, and using the INPUT statement.  If that INPUT statement truly reflects the contents of your incoming data, you could add to that DATA step:

 

length combo $ 20;

combo = compress(_infile_);

 

Respected Advisor
Posts: 4,173

Re: How do I create an identifier variable for all possible combinations?

Just a code variation for what has already been proposed.

data trial;
  length combo $20;
  input ageg1 ses1 byear1 race1 preterm1 sga1 stillbirth1 obesity1
    interval2 ageg2 preterm2 sga2 stillbirth2 obesity2 interval3 
    ageg3 preterm3 sga3 stillbirth3 obesity3;
  combo=cats(of _all_);
  datalines;
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 2 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 3 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
run;
New Contributor
Posts: 2

Re: How do I create an identifier variable for all possible combinations?

This worked well. Thanks. One follow up question. If I had a similar dataset with the combo variable created above and a variable providing the frequency of this combination. For example:

Freq combo
5 '54216101024521041021'
11 '32103241010324410101'


and I wanted to transform this into a dataset which had five observations for the first combo and 11 for the second, etc. is there an efficient way to do this?

Thanks again.

Super User
Posts: 5,509

Re: How do I create an identifier variable for all possible combinations?

You could do this fairly easily:

 

data want;

set have;

do _n_=1 to freq;

   output;

end;

run;

 

But note that it may not be necessary.  Many SAS procedures support the use of either a FREQ or a WEIGHT statement which might make the original data sufficient for analysis.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 236 views
  • 0 likes
  • 4 in conversation