Re: Matrix calculation instead of a loop

Celine_G · Posted 10-12-2015 10:00 AM

Dear all,

I'm working with 2 tables : one contains all possible combinations of 2 factors among 4:

comp_1 comp_2
      A        B
      A       C
      A         D
     B         C
       B        D
      C        D

and the other contains the statistical units :

             fact_A    fact_B    fact_C fact_D
unit1         0             0           1           1
unit2          1             0           1            1
unit3          0             1          0            1
unit4         1             0          1            0
unit5          0             1           1            1
unit6          1             1           0            0

(unit1 has C and D factors ; unit 5 has B, C and D factors)

I would like to know how many units there are for each combination (mod_1,mod_2).

In this example, the result would be :

comp_1 comp_2 nb_units
      A           B           1
      A           C           2
      A          D            1
       B           C            1
       B           D            2
       C           D            3

I can find easily this result with a loop but my problem is that, actually, there are much more than 4 factors. And thus, the treatment is too long with a loop.

So, I would like to know if my counting could be done with matrix calculation, especially with SAS IML ?

Hoping that my request is clear...

In advance thank you to all those who will try to help me.

Céline

Rick_SAS · Posted 10-12-2015 10:33 AM

First use the ALLCOMB function in SAS/IML to enumerate all pairwise combinations of the columns.

You can then extract each pair and use elementwise multiplication to form a binary vector that has 1 for units that are in common between the factors. Sum up this vector to obtain the results that you want for each pair of columns.

proc iml;
colnames = {fact_A    fact_B    fact_C   fact_D};
x = {0             0           1            1, 
     1             0           1            1, 
     0             1           0            1,
     1             0           1            0, 
     0             1           1            1, 
     1             1           0            0};

comb = allcomb(ncol(x), 2);
*print comb;
count = j(nrow(comb), 1);
do i = 1 to nrow(comb);
   v = x[ ,comb[i,1]] # x[ ,comb[i,2]]; 
   count[i] = sum(v);
end;

print comb[c={Fact1 Fact2}] count;

Rick_SAS · Posted 10-12-2015 10:39 AM

Or if you really want a matrix solution with no loops, just form the crossproduct matrix. The results are the sums in the strictly upper-triangular portion of the matrix:


results = X`*X;
print results[c=colnames r=colnames];

Celine_G · Posted 10-13-2015 10:46 AM

Thank you for your reply.

The crossproduct matrix performs calculation very quickly.

I still have to find a way to exploit easily the strictly upper-triangular portion of the matrix. But thanks to your answers, I have made great progress!

Celine

Rick_SAS · Posted 10-13-2015 02:08 PM

You can use the ROW and COL functions to extract the values, and the NDX2SUB to convert the indices to subscripts. This requires SAS/IML 12.3, but the links show how to define the functions in earlier releases. Here's an example,

proc iml;
corr = {1.0  0.6  0.5  0.4,
        0.6  1.0  0.3  0.2,
        0.5  0.3  1.0  0.1,
        0.4  0.2  0.1  1.0};
r = row(corr);
c = col(corr);
upperTri = loc(r < c); /* upper tri indices in row major order */
val = corr[upperTri];    /* vector contains n*(n-1)/2 upper triangular corr */
subs = ndx2sub(dimension(corr), upperTri);
print subs[c={"Row" "Col"}] val;

Astounding · Posted 10-12-2015 10:49 AM

When you have a ton of factors, it's not clear to me which would be the easier approach. So here's the DATA step approach. Process the second data set only to start:

data want;

set have;

array factors {5} fact_A fact_B fact_C fact_D fact_E;

length comp_1 comp_2 $ 1;

do _J_=1 to 4;

do _K_=J+1 to 5;

if factors{_J_}=factors{_K_}=1 then do;

comp_1 = scan(vname(factors{_J_}), -1, '_');

comp_2 = scan(vname(factors{_K_}, -1, '_');

output;

end;

run;

proc freq data=want;

tables comp_1 * comp_2 / noprint out=want2 (drop=percent);

run;

That gives you a summary data set with counts of all COMP_1 * COMP_2 combinations. It can be merged back into your original data set pretty easily.

in practice, you may need longer lengths for COMP_1 and COMP_2, and you may need to alter the SCAN function to pick out the proper name. And you may need to rename COUNT to NB_UNITS. But the form to the program should remain.

Good luck.

Celine_G · Posted 10-13-2015 10:48 AM

I prefer not to use loops but thank you very much for your help!

Matrix calculation instead of a loop