Dear all,
I'm working with 2 tables : one contains all possible combinations of 2 factors among 4:
comp_1 comp_2
A B
A C
A D
B C
B D
C D
and the other contains the statistical units :
fact_A fact_B fact_C fact_D
unit1 0 0 1 1
unit2 1 0 1 1
unit3 0 1 0 1
unit4 1 0 1 0
unit5 0 1 1 1
unit6 1 1 0 0
(unit1 has C and D factors ; unit 5 has B, C and D factors)
I would like to know how many units there are for each combination (mod_1,mod_2).
In this example, the result would be :
comp_1 comp_2 nb_units
A B 1
A C 2
A D 1
B C 1
B D 2
C D 3
I can find easily this result with a loop but my problem is that, actually, there are much more than 4 factors. And thus, the treatment is too long with a loop.
So, I would like to know if my counting could be done with matrix calculation, especially with SAS IML ?
Hoping that my request is clear...
In advance thank you to all those who will try to help me.
Céline
First use the ALLCOMB function in SAS/IML to enumerate all pairwise combinations of the columns.
You can then extract each pair and use elementwise multiplication to form a binary vector that has 1 for units that are in common between the factors. Sum up this vector to obtain the results that you want for each pair of columns.
proc iml;
colnames = {fact_A fact_B fact_C fact_D};
x = {0 0 1 1,
1 0 1 1,
0 1 0 1,
1 0 1 0,
0 1 1 1,
1 1 0 0};
comb = allcomb(ncol(x), 2);
*print comb;
count = j(nrow(comb), 1);
do i = 1 to nrow(comb);
v = x[ ,comb[i,1]] # x[ ,comb[i,2]];
count[i] = sum(v);
end;
print comb[c={Fact1 Fact2}] count;
Or if you really want a matrix solution with no loops, just form the crossproduct matrix. The results are the sums in the strictly upper-triangular portion of the matrix:
results = X`*X;
print results[c=colnames r=colnames];
Thank you for your reply.
The crossproduct matrix performs calculation very quickly.
I still have to find a way to exploit easily the strictly upper-triangular portion of the matrix. But thanks to your answers, I have made great progress!
Celine
You can use the ROW and COL functions to extract the values, and the NDX2SUB to convert the indices to subscripts. This requires SAS/IML 12.3, but the links show how to define the functions in earlier releases. Here's an example,
proc iml;
corr = {1.0 0.6 0.5 0.4,
0.6 1.0 0.3 0.2,
0.5 0.3 1.0 0.1,
0.4 0.2 0.1 1.0};
r = row(corr);
c = col(corr);
upperTri = loc(r < c); /* upper tri indices in row major order */
val = corr[upperTri]; /* vector contains n*(n-1)/2 upper triangular corr */
subs = ndx2sub(dimension(corr), upperTri);
print subs[c={"Row" "Col"}] val;
When you have a ton of factors, it's not clear to me which would be the easier approach. So here's the DATA step approach. Process the second data set only to start:
data want;
set have;
array factors {5} fact_A fact_B fact_C fact_D fact_E;
length comp_1 comp_2 $ 1;
do _J_=1 to 4;
do _K_=J+1 to 5;
if factors{_J_}=factors{_K_}=1 then do;
comp_1 = scan(vname(factors{_J_}), -1, '_');
comp_2 = scan(vname(factors{_K_}, -1, '_');
output;
end;
end;
end;
run;
proc freq data=want;
tables comp_1 * comp_2 / noprint out=want2 (drop=percent);
run;
That gives you a summary data set with counts of all COMP_1 * COMP_2 combinations. It can be merged back into your original data set pretty easily.
in practice, you may need longer lengths for COMP_1 and COMP_2, and you may need to alter the SCAN function to pick out the proper name. And you may need to rename COUNT to NB_UNITS. But the form to the program should remain.
Good luck.