BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
acerickson
Obsidian | Level 7

Hello,
I'm a SAS noob (using SAS 9.4) and couldn't find an existing question to satisfy my question, so here I am.

I need to create a series of dummy variable matrices of several categorical variables for all possible combinations of a set of stratifying variables. In the simplified example below, I input the 4 actual stratifying variables and possible values (a-d), and 2 categorical variables (ms3c and edu2c). If you run the code below, the resulting datasets (test2, test3) represent what I will want as output but as separate csv files. SHould end up with 840 different matrices.

I have been try to include a 'allcomb' function in a DO Loop, but can't figure out the syntax.
Thanks for any help.

 

data test1; 
input uid a b c d ms3c edu2c;
cards;
1 1 1 1 1 2 2
2 1 2 1 1 3 2
3 1 3 3 3 3 1
4 2 4 4 4 2 1
5 1 5 5 5 1 2
6 2 6 1 6 2 1
7 1 7 2 7 1 2
8 2 8 3 6 3 2
9 1 9 4 5 3 1
10 2 10 5 4 2 2
11 2 11 1 3 2 2
12 1 12 2 2 1 2
13 1 1 1 1 3 1
14 1 2 1 1 2 2
15 2 3 5 3 2 1
run;

* this gives an idea of what I need to do, create 1 matrix for every combination of strata variables; 
data test2 (drop=uid a b c d ms3c edu2c);
 set test1;
 if a=1 and b=1 and c=1 and d=1 and ms3c=2 then ms2=1; else ms2=0;
 if a=1 and b=1 and c=1 and d=1 and ms3c=3 then ms3=1; else ms3=0;
 if a=1 and b=1 and c=1 and d=1 and edu2c=2 then edu2=1; else edu2=0;
 * export to csv;
run;
*;
data test3 (drop=uid a b c d ms3c edu2c);
 set test1;
 if a=1 and b=2 and c=1 and d=1 and ms3c=2 then ms2=1; else ms2=0;
 if a=1 and b=2 and c=1 and d=1 and ms3c=3 then ms3=1; else ms3=0;
 if a=1 and b=2 and c=1 and d=1 and edu2c=2 then edu2=1; else edu2=0;
* export to csv;
run;
* ... and so on ...;
1 ACCEPTED SOLUTION

Accepted Solutions
acerickson
Obsidian | Level 7

Brilliant! Thank you*840  to Astounding.

So putting it all together we have the following. Perhaps there is a more efficient program, but I am happy with this.

* create a new variable identifying the unique strata;
data all_combinations;
    do a=1 to 2;
    do b=1 to 12;
    do c=1 to 5;
    do d=1 to 7;
        single_variable = d + 10*c + 100*b + 10000 * a;
   output;
end; end; end; end;
run;

* now merge that back to the main dataset (test1);
proc sort data=test1;
    by a b c d;
run;
proc sort data=all_combinations;
    by a b c d;
run;
data merge1;
    merge test1 all_combinations;
    by a b c d;
run;

* create the dummy codes by strata for the variables (ms3c and edu2c);
data test2 (keep= single_variable ms2 ms3 edu2);
  set merge1;
    if ms3c=2 then ms2=1; else ms2=0;
    if ms3c=3 then ms3=1; else ms3=0;
    if edu2c=2 then edu2=1; else edu2=0;
run;

* now create a dummy code matrix table for each unique strata (840) and name the tables with the strata id code (single_variable);
data _null_;
call execute('proc sql') ;
do until (done);
   set test2 end=done;
   dsname = put(single_variable, z5.);
   call execute('create table xmat_' || dsname || ' as select * from test4 where single_variable= ' || dsname || ';' ) ;
end;
call execute('quit;' ) ;
stop;
run; 

 

View solution in original post

19 REPLIES 19
Reeza
Super User

First question, why are you doing this?

 

Second:

I've listed a few of the options here:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-dummy-variables-Categorical-Var...

 

Do any of those work for you?

 


@acerickson wrote:

Hello,
I'm a SAS noob (using SAS 9.4) and couldn't find an existing question to satisfy my question, so here I am.

I need to create a series of dummy variable matrices of several categorical variables for all possible combinations of a set of stratifying variables. In the simplified example below, I input the 4 actual stratifying variables and possible values (a-d), and 2 categorical variables (ms3c and edu2c). If you run the code below, the resulting datasets (test2, test3) represent what I will want as output but as separate csv files. SHould end up with 840 different matrices.

I have been try to include a 'allcomb' function in a DO Loop, but can't figure out the syntax.
Thanks for any help.

 

data test1; 
input uid a b c d ms3c edu2c;
cards;
1 1 1 1 1 2 2
2 1 2 1 1 3 2
3 1 3 3 3 3 1
4 2 4 4 4 2 1
5 1 5 5 5 1 2
6 2 6 1 6 2 1
7 1 7 2 7 1 2
8 2 8 3 6 3 2
9 1 9 4 5 3 1
10 2 10 5 4 2 2
11 2 11 1 3 2 2
12 1 12 2 2 1 2
13 1 1 1 1 3 1
14 1 2 1 1 2 2
15 2 3 5 3 2 1
run;

* this gives an idea of what I need to do, create 1 matrix for every combination of strata variables; 
data test2 (drop=uid a b c d ms3c edu2c);
 set test1;
 if a=1 and b=1 and c=1 and d=1 and ms3c=2 then ms2=1; else ms2=0;
 if a=1 and b=1 and c=1 and d=1 and ms3c=3 then ms3=1; else ms3=0;
 if a=1 and b=1 and c=1 and d=1 and edu2c=2 then edu2=1; else edu2=0;
 * export to csv;
run;
*;
data test3 (drop=uid a b c d ms3c edu2c);
 set test1;
 if a=1 and b=2 and c=1 and d=1 and ms3c=2 then ms2=1; else ms2=0;
 if a=1 and b=2 and c=1 and d=1 and ms3c=3 then ms3=1; else ms3=0;
 if a=1 and b=2 and c=1 and d=1 and edu2c=2 then edu2=1; else edu2=0;
* export to csv;
run;
* ... and so on ...;

 

acerickson
Obsidian | Level 7

thanks for the quick reply.

 

it's in order to perform an indirect adjustment in a Cox model using two distinct but representative datasets. The model has 4 stratifying variables (sex, 5-year age groups, urban size and airshed). 

 

I don't think those suggestions will work easily as I need the program to loop through all possible combinations of the stratifying variables.

 

 

Reeza
Super User

Can we assume you're familiar with how SAS handles CLASS and STRATA variables then?

And that SAS automatically creates the dummy variables for all levels present in the data already?

 

 

acerickson
Obsidian | Level 7

yes, that is within the PROC statement for the Cox model (e.g. proc phreg). 

 

I'm calculating a series of Cox models, then taking a summed ratio of them (see attached for a summary slide). 

ballardw
Super User

1) I don't think that your "and so on" after data test3 is as obvious as you think it may be.

 

I am a tad concerned about resue of the dummy variable names ms2, ms3 and edu2. Same variable name with different meanings will almost always lead to some confusion.

 

I find the easiest, though not the most elegant, way to get "all combinations" of multiple variables is nested arrays. I can't actually tell whay your limits are on each variable but this should demonstrate:

data example;
   /* variable a to have ranges of 1,2,3
      variable b to have range of 88,99
      variable c to have range of 'AB', 'AC', 'EX'
   */
   do a= 1 to 3;
      do b = 88,99;
         do c = 'AB','AC','EX';
         output;
         end;
      end;
   end;
run;

The example provides a mix of sequential integers, non-sequential integers and a character variable. The order of the nesting isn't really import though that output statement is critical.

 

 

How you use those to make your dummies is up to you. I would likely be tempted to put the code for that in the innermost do loop (BEFORE the output), but that would require different names from the way you did it.

acerickson
Obsidian | Level 7

thanks for the response, that a helpful start.  I can use different variable names, that's no problem.

 

apologies for the unclear description.

 

There are 4 stratifying variables with the following ranges:

a (1,2)

b (1,12)

c (1,5)

d (1,7)

 

so what I mean by the "..and so on..." comment, is that I would need to repeat those 3 IF statements 840 times in order to cycle through all possible combinations of the 4 stratifying variables. 

thanks again!

Astounding
PROC Star

Really just a shot in the dark here, since it's too difficult for me to figure out what you will do with this later.  Can't you just use a single variable that takes on 840 different values?  The matrix would be:

 

data all_combinations;

do a=1 to 2;

do b=1 to 12;

do c=1 to 5;

do d=1 to 7;

   single_variable = d + 10*c + 100*b + 10000 * a;

   output;

end; end; end; end;

run;

 

You need 10,000 (not 1,000) as the multiplier for A, since B can be two digits.

acerickson
Obsidian | Level 7

thanks for the input, I'm trying to think how to possibly make this work...  if I was able to assign the new 'single_variable' to the corresponding individual records based on their values for the variables a to d, that would be something. 

Astounding
PROC Star

Couldn't you just use the same formula for SINGLE_VARIABLE, using A, B, C, and D that are found in your data?

acerickson
Obsidian | Level 7

yes, exactly. This is great as it solves a couple other questions, but still need to create the matrices.

 

So now I have this 'all_combinations' variable that identifies which of the 840 strata an individual belongs to, so now I can run a BY statement to create the matrices.

acerickson
Obsidian | Level 7

so close to solving this, just need help with the syntax using a FILE statement with the filevar= option.

Was trying to follow the steps from the FILE statement page here.

 

* create a new variable identifying the unique strata;
data all_combinations;
do a=1 to 2;
do b=1 to 12;
do c=1 to 5;
do d=1 to 7;
   single_variable = d + 10*c + 100*b + 10000 * a;
   output;
end; end; end; end;
run;

* now merge that back to the main dataset;
proc sort data=test1;
  by a b c d;
run;
proc sort data=all_combinations;
  by a b c d;
run;
data merge1;
  merge test1 all_combinations;
  by a b c d;
run;

/* now create the separate matrices based on the distinct 840 strata 
using a FILE statement to export to a CSV and dynamically update the 
filename to reflect its strata */
* may need to convert 'single_variable' to a character (from numeric);

data test4 (keep= ms2 ms3 edu2);
  set merge1;
  if ms3c=2 then ms2=1; else ms2=0;
  if ms3c=3 then ms3=1; else ms3=0;
  if edu2c=2 then edu2=1; else edu2=0;
  by single_variable;
  file file-specification <device-type> filevar=sinlge_variable; * need help here;
run;
Astounding
PROC Star

A few notes then ...

 

When using FILEVAR=, the file-specification is just a place holder.  You can use any name not already defined by your program, such as _out_.  

 

The device-type is optional and rarely used.  Just remove it.

 

The general idea is to create a character variable holding the complete path to the output file you would like to use.  For example:

 

destination = 'C:\myfolder\combination' || put(single_variable, z5.) || '.txt';

 

Then complete the FILE statement with:

 

filevar = destination;

acerickson
Obsidian | Level 7

still having issues with creating the separate files (couldn't get the FILE statement to work properly)

 

 

/* want to create a series of tables that loop through
the values of a variable (single_variable) and name it
with the corresponding value with the prefix "xmat_". */ **** for example *** ; proc sql ; create table xmat_10111 as select * FROM test4 where single_variable=10111; * repeat where single_variable =10111 to 21257; *******************************************; * what I have so far is; %macro trial; %do i=10111 %to 21257; proc sql ; create table xmat_%i as select * FROM test4 where single_variable=%i; %end; %mend ;

 

Astounding
PROC Star

Be glad you didn't get this to work.  Remember SINGLE_VARIABLE only takes on 840 values. If this had worked, you would have created about 10,000 empty tables.

 

One thing to fix (before we abandon this approach) is that you refer to a macro variable with an ampersand:  &i rather than %i

 

A better way to generate 840 CREATE statements, assuming you have that data set ALL_COMBINATIONS:

 

data _null_;

call execute('proc sql'; ) ;

do until (done);

   set all_combinations end=done;

   dsname = put(single_variable, z5.);

   call execute('create table xmat_' || dsname || ' as select * from test4 where single_variable= ' || dsname || ';' ) ;

end;

call execute('quit;' ) ;

stop;

run; 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 2274 views
  • 7 likes
  • 4 in conversation