BookmarkSubscribeRSS Feed
Celine_G
Calcite | Level 5

Dear all,

 

I'm working with 2 tables : one contains all possible combinations of 2 factors among 4:

 

comp_1 comp_2
       A         B
       A         C
       A         D
       B         C
       B         D
       C         D

 

and the other contains the statistical units :
 
             fact_A    fact_B    fact_C   fact_D
unit1          0             0           1            1 
unit2          1             0           1            1 
unit3          0             1           0            1 
unit4          1             0           1            0 
unit5          0             1           1            1 
unit6          1             1           0            0   

 

(unit1 has C and D factors ; unit 5 has B, C and D factors)

 

I would like to know how many units there are for each combination (mod_1,mod_2).

In this example, the result would be :


comp_1 comp_2  nb_units
       A           B            1
       A           C            2
       A           D            1
       B           C            1
       B           D            2
       C           D            3


I can find easily this result with a loop but my problem is that, actually, there are much more than 4 factors. And thus, the treatment is too long with a loop.

So, I would like to know if my counting could be done with matrix calculation, especially with SAS IML ?

 

Hoping that my request is clear...

 

In advance thank you to all those who will try to help me.

 

Céline

 

6 REPLIES 6
Rick_SAS
SAS Super FREQ

First use the ALLCOMB function in SAS/IML to enumerate all pairwise combinations of the columns.

You can then extract each pair and use elementwise multiplication to form a binary vector that has 1 for units that are in common between the factors. Sum up this vector to obtain the results that you want for each pair of columns.

 

 

proc iml;
colnames = {fact_A    fact_B    fact_C   fact_D};
x = {0             0           1            1, 
     1             0           1            1, 
     0             1           0            1,
     1             0           1            0, 
     0             1           1            1, 
     1             1           0            0};

comb = allcomb(ncol(x), 2);
*print comb;
count = j(nrow(comb), 1);
do i = 1 to nrow(comb);
   v = x[ ,comb[i,1]] # x[ ,comb[i,2]]; 
   count[i] = sum(v);
end;

print comb[c={Fact1 Fact2}] count;
Rick_SAS
SAS Super FREQ

Or if you really want a matrix solution with no loops, just form the crossproduct matrix.  The results are the sums in the strictly upper-triangular portion of the matrix:

 


results = X`*X;
print results[c=colnames r=colnames];
Celine_G
Calcite | Level 5

Thank you for your reply.

The crossproduct matrix performs calculation very quickly.

 

I still have to find a way to exploit easily the strictly upper-triangular portion of the matrix. But thanks to your answers, I have made great progress!

 

Celine

 

Rick_SAS
SAS Super FREQ

You can use the ROW and COL functions to extract the values, and the NDX2SUB to convert the indices to subscripts. This requires SAS/IML 12.3, but the links show how to define the functions in earlier releases.  Here's an example,

 

proc iml;
corr = {1.0  0.6  0.5  0.4,
        0.6  1.0  0.3  0.2,
        0.5  0.3  1.0  0.1,
        0.4  0.2  0.1  1.0};
r = row(corr);
c = col(corr);
upperTri = loc(r < c); /* upper tri indices in row major order */
val = corr[upperTri];    /* vector contains n*(n-1)/2 upper triangular corr */
subs = ndx2sub(dimension(corr), upperTri);
print subs[c={"Row" "Col"}] val;
Astounding
PROC Star

When you have a ton of factors, it's not clear to me which would be the easier approach.  So here's the DATA step approach.  Process the second data set only to start:

 

data want;

   set have;

   array factors {5} fact_A fact_B fact_C fact_D fact_E;

   length comp_1 comp_2 $ 1;

   do _J_=1 to 4;

      do _K_=J+1 to 5;

         if factors{_J_}=factors{_K_}=1 then do;

            comp_1 = scan(vname(factors{_J_}), -1, '_');

            comp_2 = scan(vname(factors{_K_}, -1, '_');

            output;

         end;

      end;

   end;

run;

 

proc freq data=want;

   tables comp_1 * comp_2 / noprint out=want2 (drop=percent);

run;

 

That gives you a summary data set with counts of all COMP_1 * COMP_2 combinations.  It can be merged back into your original data set pretty easily.

 

in practice, you may need longer lengths for COMP_1 and COMP_2, and you may need to alter the SCAN function to pick out the proper name.  And you may need to rename COUNT to NB_UNITS.  But the form to the program should remain.

 

Good luck.

Celine_G
Calcite | Level 5
I prefer not to use loops but thank you very much for your help!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 6 replies
  • 1068 views
  • 0 likes
  • 3 in conversation