Solved: Re: Count number of values shared between variables - Page 2

Tom · Posted 09-25-2023 06:37 PM

Use CALL LEXICOMB().

Let's try it for 4 variables and see what we can do. Let's make a dataset with character variables named NAME1 to NAME4 with the variable names.

data names;
  input name $32.;
cards;
var1
var2
var3
var4
;

proc transpose data=names out=wide(drop=_name_) prefix=name;
  var name;
run;;

Now let's make all possible combinations of taking them 1,2,3,... at a time.

data combi ;
  length n k comb 8;
  set wide;
  array names name:;
  n=dim(names);
  do k=1 to n;
    p=1;
    set wide point=p;
    do comb=1 to comb(n,k);
      call lexcomb(comb,k,of names[*]);
      output;
    end;
  end;
run;

Now let's use that to generate code. And at the same time let's blank out the non included variable names.


filename code temp;
data combi;
  set combi end=eof;
  array names name: ;
  file code;
  if _n_=1 then put 'create table counts as ' / ' ' @;
  else put ',' @;
  put 'sum( ' names[1] @;
  do i=2 to k ;
    put 'and ' names[i] @ ;
  end;
  put ') as count_' k +(-1) '_' comb ;
  do i=k+1 to n;
    names[i]=' ';
  end;
  if eof then  put 'from boolean' / ';' ;
  drop i;
run;

SO it generated this code which you could run with %INCLUDE

create table counts as
 sum( var1 ) as count_1_1
,sum( var2 ) as count_1_2
,sum( var3 ) as count_1_3
,sum( var4 ) as count_1_4
,sum( var1 and var2 ) as count_2_1
,sum( var1 and var3 ) as count_2_2
,sum( var1 and var4 ) as count_2_3
,sum( var2 and var3 ) as count_2_4
,sum( var2 and var4 ) as count_2_5
,sum( var3 and var4 ) as count_2_6
,sum( var1 and var2 and var3 ) as count_3_1
,sum( var1 and var2 and var4 ) as count_3_2
,sum( var1 and var3 and var4 ) as count_3_3
,sum( var2 and var3 and var4 ) as count_3_4
,sum( var1 and var2 and var3 and var4 ) as count_4_1
from boolean
;

And this is what the COMBI dataset looks like now:

Obs    n    k    comb    name1    name2    name3    name4

  1    4    1      1     var1
  2    4    1      2     var2
  3    4    1      3     var3
  4    4    1      4     var4
  5    4    2      1     var1     var2
  6    4    2      2     var1     var3
  7    4    2      3     var1     var4
  8    4    2      4     var2     var3
  9    4    2      5     var2     var4
 10    4    2      6     var3     var4
 11    4    3      1     var1     var2     var3
 12    4    3      2     var1     var2     var4
 13    4    3      3     var1     var3     var4
 14    4    3      4     var2     var3     var4
 15    4    4      1     var1     var2     var3     var4

Reeza · Posted 09-25-2023 06:16 PM

Once you're in this form, I think you can use PROC CORR to get the 2 way summaries faster I believe? or PROC DISTANCE?

Jean-Jacques · Posted 09-26-2023 08:57 AM

Reeza, by "this form, do you mean the tall form?
PROC DISTANCE, then CLUSTER might in fact be a good solution to the underlying question (help users decide which vars i.e. tasks can be combined so as to minimize duplication of effort) effort)

mkeintz · Posted 09-25-2023 07:12 PM

At first, I thought you wanted to count instances of different variables simultaneously having the same value. But you apparently want to look for whether they EVER has the same value.

You can create a hash object indexed by VALUE, accompanied by an array of dummy variables indicating whether the corresponding original variable ever has that value.

After completing the hash object, generate a list of variable names matching the pattern of dummies, then do a frequency of that variable list.

dm 'clear log;clear out;';
data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
run;


data dummies;
  retain value dummy1-dummy4 .;
  stop;
run;

data want (keep=value groupsize varlist);
  set have  end=end_of_have;
  array vars {*} _numeric_ ;

  if _n_=1 then do;
    if 0 then set dummies;
    array dum {*}  dummy: ;
    declare hash h (dataset:'dummies',ordered:'a');
      h.definekey('value');
      h.definedata(all:'Y');
      h.definedone();
    declare hiter hi ('h');
  end;

  do v=1 to dim(vars);
    if vars{v}=. then continue;   /*Skip to next iteration */ 
    value=vars{v};
    if h.find()^=0 then call missing(of dum{*});
    dum{v}=1;
    h.replace();
  end;

  if end_of_have;
  *Build an array of variable names *;
  array vnames {4} $32 ;
  length varlist $200 ;
  do v=1 to dim(vars);
    vnames{v}=vname(vars{v});
  end;

  do rc=hi.first() by 0 until (hi.next()^=0);
    groupsize=sum(of dummy:);
    if groupsize=1 then continue;
    varlist=' ';
    do v=1 to dim(vars);
      if dum{v}=1 then varlist=catx(',',varlist,vnames{v});
    end;
    output;
  end;
run;
proc freq data=want;
  tables varlist;
run;

This program finds groups of shared values of all sizes, not just dyads.

I presume that if you have a value common to three variables, it is noted as a triple, but NOT as the three doubles that could be extracted from the triple. I.e. no double counting.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 09-26-2023 07:32 AM

/*It is IML thing.*/
data have;
input var1 var2 var3 var4;
datalines;
1 5 2 1
2 6 3 2
5 19 6 4
6 20 10 6
7 25 13 7
8 26 14 8
12 28 17 12
15 . 21 13
17 . 22 28
21 . 23 30
22 . 25 .
23 . 27 .
25 . 28 .
29 . 30 .
30 . . .

;
run;

/*For  combinations of TWO variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-1;
 do j=i+1 to n;
   set1=set1//vname[i];
   set2=set2//vname[j];
   a=x[,i];a=a[loc(a^=.)];
   b=x[,j];b=b[loc(b^=.)];
   shared_count=shared_count//sum(element(a,b)); 
 end;
end;
create want_two var {set1 set2 shared_count};
append;
close;
quit;



/*For  combinations of THREE variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-2;
 do j=i+1 to n-1;
  do k=j+1 to n;
   set1=set1//vname[i];
   set2=set2//vname[j];
   set3=set3//vname[k];
   a=x[,i];a=a[loc(a^=.)];
   b=x[,j];b=b[loc(b^=.)];
   c=x[,k];c=c[loc(c^=.)];
   shared_count=shared_count//sum(element(a,b) & element(a,c)); 
  end;
 end;
end;
create want_three var {set1 set2 set3 shared_count};
append;
close;
quit;

Jean-Jacques · Posted 09-26-2023 08:59 AM

KSharp,
this looks extremely economical, and of course it works perfectly as requested. I am among the users who have never used PROC IML.
How would you extend this to more than three-way combinations?

Ksharp · Posted 09-27-2023 07:32 AM

You could learn SAS/IML from Rick Wicklin's blog .

https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/bd-p/sas_iml

data have;
input var1-var6;
datalines;
1 5 2     1    1    1
2 6 3     2    2    2 
5 19 6    4    4    4
6 20 10   6    6    6
7 25 13   7    7    7
8 26 14  8    8    8
12 28 17 12   12   12
15 . 21  13   13   13
17 . 22  28   28   28
21 . 23  30   30   30
22 . 25  .    .    .
23 . 27  .    .    .
25 . 28  .    .    .
29 . 30  .    .    .
30 . .   .    .    .
;
run;
/*For  combinations of FOUR variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-3;
 do j=i+1 to n-2;
  do k=j+1 to n-1;
   do l=k+1 to n;
    set1=set1//vname[i];
    set2=set2//vname[j];
    set3=set3//vname[k];
	set4=set4//vname[l];

    a=x[,i];a=a[loc(a^=.)];
    b=x[,j];b=b[loc(b^=.)];
    c=x[,k];c=c[loc(c^=.)];
    d=x[,l];d=d[loc(d^=.)];

    shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d)); 
   end;
  end;
 end;
end;
create want_four var {set1 set2 set3 set4 shared_count};
append;
close;
quit;

/*For  combinations of FIVE variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-4;
 do j=i+1 to n-3;
  do k=j+1 to n-2;
   do l=k+1 to n-1;
    do m=l+1 to n;
     set1=set1//vname[i];
     set2=set2//vname[j];
     set3=set3//vname[k];
 	 set4=set4//vname[l];
 	 set5=set5//vname[m];

     a=x[,i];a=a[loc(a^=.)];
     b=x[,j];b=b[loc(b^=.)];
     c=x[,k];c=c[loc(c^=.)];
     d=x[,l];d=d[loc(d^=.)];
     e=x[,m];e=e[loc(e^=.)];

    shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d) & element(a,e)); 
	end;
   end;
  end;
 end;
end;
create want_five var {set1 set2 set3 set4 set5 shared_count};
append;
close;
quit;

Ksharp · Posted 09-27-2023 07:43 AM

https://blogs.sas.com/content/iml/

Jean-Jacques · Posted 09-27-2023 02:42 PM

Thank you!
I was struggling with how to define the elements in

sum(element(a,b) & element(a,c));

for the higher order combinations. Your additional examples for combinations of four and five variables demonstrate how to do it.

Jean-Jacques · Posted 09-27-2023 02:51 PM

Additional notes.
Thank you to all. The community's generosity is always amazing.
KSharp answered the original question exactly and his solution is very economical. It requires understanding IML.

Tom also answered the question exactly, and his solution is easier to understand and modify for someone with my current skill level.

Reeza's suggestion to use PROC DISTANCE does provide a great solution to the problem that justified my question.
In fact, one would use PROC DISTANCE to compute the Jaccard index, the PROC CLUSTER to inform how to group tasks optimally.
In other words Reeza's suggestion leads to a better solution than the one that justified my question.

That said, I learned several new methods and tips that I am saving to my set of favorite snippets.

s_lassen · Posted 10-01-2023 07:27 AM

I know you already have a solution, but this one may work as well, and can be done without IML:

First, put an ID on the records (if you do not already have one), and transpose to long format:

data with_id;
  set have;
  id=_N_;
run;

proc transpose data=with_id out=long;
  by id;
  var var1-var4;
run;

Then sort by the id and the values, and count:

proc sort data=long;
  by id col1;
  where col1 is not null;
run;

data all;
  do N=1 by 1 until(last.col1);
    set long;
    by id col1;
    length cols $200;
    if N<6 then /* we only want the first 5 */
      call catx(' ',cols,_name_);
    end;
  if N>1;
  n=min(N,5); /* we only count up to 5 */
run;

(if you drop the two lines with the comments, you will get all combinations, not just the first 5)

And then summarize:

proc summary data=all nway;
  class n cols;
  output out=want(rename=(_freq_=shared_count) drop=_TYPE_);
run;

I think this solution is simpler, and it will get you all the combinations you want in one fell swoop.

Jean-Jacques · Posted 10-01-2023 01:33 PM

It is simple, but the Want dataset has only one observation.

This is what is expected for the two-way combinations only:

data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 4
var1 var3 9
var1 var4 8
var2 var3 3
var2 var4 2
var3 var4 5
;
run;

s_lassen · Posted 10-02-2023 08:46 AM

Sorry, I misunderstood the problem.

A question: How do you get the number 9 for "var1,var3"? With the data below I only find 7 different combinations, and then there are two different observations where var1=25, giving a total of 8.

I used this data:

data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;

I changed the code after the sort to this:


data comb2(keep=col1 set1 set2) 
     comb3(keep=col1 set1 set2 set3)  
     comb4(keep=col1 set1 set2 set3 set4) 
     comb5(keep=col1 set1 set2 set3 set4 set5);
  n=0;
  do until(last.col1);
    set long;
    by col1 _name_;
    n+1;
    length cols $200;
    call catx(' ',cols,_name_);
    end;
  if N>1;
  do I1=1 to N-1;
    set1=scan(cols,I1);
    do I2=I1+1 to N;
      set2=scan(cols,I2);
      if set1=set2 then continue;
      output comb2;
      do I3=I2+1 to N;
        set3=scan(cols,I3);
        if set3=set2 then continue;
        output comb3;
        do I4=I3+1 to N;
          set4=scan(cols,I4);
          if set4=set3 then continue;
          output comb4;
          do I5=I4+1 to N;
            set5=scan(cols,I5);
            if set5=set4 then continue;
            output comb5;
            end;
          end;
        end;
      end;
    end;
run;
    
    
proc summary data=comb2 nway;
  class set1 set2;
  output out=want2(drop=_type_ rename=(_freq_=shared_count));
run;

which seems to get what I think is the right result, but not the one you stipulated.

Sorry, you already have a solution; thank you for taking the time to answer anyway.

Jean-Jacques · Posted 10-02-2023 10:15 AM

Yes, you are right, I was sloppy in creating the datasets Have and Want by hand.
I like your solution a lot and will be keeping it.
So the thread contains three excellent and very different solutions.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!