Hello,
Let's say that have raw data that including information of customer ID and also 10 binary variables that can get values 1/0.
Let's say that some variables can get simultaneously value 1 and some cannot.
I want to create a matrix that help me to learn which variables can get value 1 simultaneously.
Data Raw_tbl;
input ID x1-x5;
cards;
1 1 1 0 0 0
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 1 1
6 0 0 0 1 0
7 0 0 0 1 1
8 0 1 1 0 0
9 1 0 0 0 0
10 1 1 0 0 0
11 1 1 1 1 1
;
Run;
The required matrix:
when X1 get value 1 then only X2 can get value 1
when X2 get value 1 then only X2,X3,X4,X5 can get value 1
when X3 get value 1 then only X2 can get value 1
when X4 get value 1 then only X2,x4,x5 can get value 1
when X5 get value 1 then only X4 can get value 1
what is the way to find the matrix based on the raw data?
So you want to make matrix that indicates if the two variables are ever true on the same observation?
And you want to lower triangle to be all zeros?
Something like this will build the matrix in a temporary array and then write it out at the end of the data step.
data want;
array flags [10,10] _temporary_ (100*0) ;
length name $32 ;
set raw_tbl end=eof;
array x x1-x10;
do row=1 to dim(x);
if x[row] then do col=row to dim(x);
if x[col] then flags[row,col]=1;
end;
end;
if eof then do;
do row=1 to dim(x);
name = vname(x[row]);
do col=1 to dim(x);
x[col]=flags[row,col];
end;
output;
end;
end;
keep name x1-x10;
run;
Result:
Obs name x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 1 x1 1 1 0 0 0 1 1 1 0 0 2 x2 0 1 0 0 0 1 1 0 0 0 3 x3 0 0 1 0 0 1 0 0 0 0 4 x4 0 0 0 1 1 1 0 1 1 1 5 x5 0 0 0 0 1 1 0 1 1 1 6 x6 0 0 0 0 0 1 1 1 1 1 7 x7 0 0 0 0 0 0 1 1 0 0 8 x8 0 0 0 0 0 0 0 1 1 0 9 x9 0 0 0 0 0 0 0 0 1 0 10 x10 0 0 0 0 0 0 0 0 0 1
You can change it to count instead by changing this line to use a sum statement instead
if x[col] then flags[row,col]+1;
Now you can see how often the two are both true. And the diagonal is now how often that variable was true. SO X6 was true 5 times.
Obs name x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 1 x1 2 1 0 0 0 2 2 1 0 0 2 x2 0 1 0 0 0 1 1 0 0 0 3 x3 0 0 1 0 0 1 0 0 0 0 4 x4 0 0 0 2 2 2 0 1 1 1 5 x5 0 0 0 0 2 2 0 1 1 1 6 x6 0 0 0 0 0 5 2 2 1 1 7 x7 0 0 0 0 0 0 2 1 0 0 8 x8 0 0 0 0 0 0 0 2 1 0 9 x9 0 0 0 0 0 0 0 0 1 0 10 x10 0 0 0 0 0 0 0 0 0 1
Thanks,
May you please explain:
1-Why did you write 10,10? IS it because the required matrix is on dimensions 10x10?
In this case the required matrix is 5x5 because there are 5 binary variables
2-What dies it mean _temporary_ (100*0)?
Is it an array called _temporary_ ? Why did you write 100*0?
3-You defined array called X (array x x1-x10).
Why did you write X1-X10? (In this example there are 5 binary variables and not 10)
4-You wrote "Now you can see how often the two are both true. And the diagonal is now how often that variable was true", Do you mean that you do here two way frequency table of binary variables?
Will this soultion work well by your opinion?
proc freq data=raw_tbl;
table X1* (_all_);
table X2* (_all_);
table X3* (_all_);
table X4* (_all_);
table X5* (_all_);
run;
I used 10 because I didn't pay enough attention to how many variables you had. The size of the temporary matrix can be whatever you want as long as both the number of rows and number of columns is at least as large as the set of variables you are testing. You could just fix the second array definition and the extra three variables will not be created. The sample data you posted actually only has X1 to X5.
array x x1-x5;
...
keep name x1-x5;
The ( 100 * 0 ) is to initialize the array to zeros. Read the documentation on the ARRAY statement. 100 is because there are 100 variables in a 10 by 10 array.
I initialize the array to all zeros to prevent the values from defaulting to missing. You could leave them missing or replace the missing values with zeros in the final DO loop that generates the output observations instead if you wanted.
The frequency cross tabs that code like
proc freq data=raw_tbl;
tables (x1-x5)*(x1-x5) / list;
run;
would generated is something totally different than what you appear to have requested.
That extra line at the end makes all of them interconnected.
Without that:
data Raw_tbl;
input ID x1-x5;
cards;
1 1 1 0 0 0
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 1 1
6 0 0 0 1 0
7 0 0 0 1 1
8 0 1 1 0 0
9 1 0 0 0 0
10 1 1 0 0 0
;
You get this result:
Obs name x1 x2 x3 x4 x5 1 x1 1 1 0 0 0 2 x2 0 1 1 0 0 3 x3 0 0 1 0 0 4 x4 0 0 0 1 1 5 x5 0 0 0 0 1
Data Raw_tbl;
input ID x1-x5;
cards;
1 1 1 0 0 0
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 0 0 1
5 0 0 0 1 1
6 0 0 0 1 0
7 0 0 0 1 1
8 0 1 1 0 0
9 1 0 0 0 0
10 1 1 0 0 0
;
Run;
proc corr data=Raw_tbl out=temp sscp noprint nocorr ;
var x1-x5 ;
run;
data sscp;
set temp;
if _type_='SSCP' and _name_ ne 'Intercept';
drop Intercept;
run;
data want;
set sscp;
array x{*} x:;
do i=1 to dim(x);
if x{i} ne 0 then x{i}=1;
end;
drop i _type_;
run;
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.