Use CALL LEXICOMB().
Let's try it for 4 variables and see what we can do. Let's make a dataset with character variables named NAME1 to NAME4 with the variable names.
data names;
input name $32.;
cards;
var1
var2
var3
var4
;
proc transpose data=names out=wide(drop=_name_) prefix=name;
var name;
run;;
Now let's make all possible combinations of taking them 1,2,3,... at a time.
data combi ;
length n k comb 8;
set wide;
array names name:;
n=dim(names);
do k=1 to n;
p=1;
set wide point=p;
do comb=1 to comb(n,k);
call lexcomb(comb,k,of names[*]);
output;
end;
end;
run;
Now let's use that to generate code. And at the same time let's blank out the non included variable names.
filename code temp;
data combi;
set combi end=eof;
array names name: ;
file code;
if _n_=1 then put 'create table counts as ' / ' ' @;
else put ',' @;
put 'sum( ' names[1] @;
do i=2 to k ;
put 'and ' names[i] @ ;
end;
put ') as count_' k +(-1) '_' comb ;
do i=k+1 to n;
names[i]=' ';
end;
if eof then put 'from boolean' / ';' ;
drop i;
run;
SO it generated this code which you could run with %INCLUDE
create table counts as
sum( var1 ) as count_1_1
,sum( var2 ) as count_1_2
,sum( var3 ) as count_1_3
,sum( var4 ) as count_1_4
,sum( var1 and var2 ) as count_2_1
,sum( var1 and var3 ) as count_2_2
,sum( var1 and var4 ) as count_2_3
,sum( var2 and var3 ) as count_2_4
,sum( var2 and var4 ) as count_2_5
,sum( var3 and var4 ) as count_2_6
,sum( var1 and var2 and var3 ) as count_3_1
,sum( var1 and var2 and var4 ) as count_3_2
,sum( var1 and var3 and var4 ) as count_3_3
,sum( var2 and var3 and var4 ) as count_3_4
,sum( var1 and var2 and var3 and var4 ) as count_4_1
from boolean
;
And this is what the COMBI dataset looks like now:
Obs n k comb name1 name2 name3 name4 1 4 1 1 var1 2 4 1 2 var2 3 4 1 3 var3 4 4 1 4 var4 5 4 2 1 var1 var2 6 4 2 2 var1 var3 7 4 2 3 var1 var4 8 4 2 4 var2 var3 9 4 2 5 var2 var4 10 4 2 6 var3 var4 11 4 3 1 var1 var2 var3 12 4 3 2 var1 var2 var4 13 4 3 3 var1 var3 var4 14 4 3 4 var2 var3 var4 15 4 4 1 var1 var2 var3 var4
At first, I thought you wanted to count instances of different variables simultaneously having the same value. But you apparently want to look for whether they EVER has the same value.
You can create a hash object indexed by VALUE, accompanied by an array of dummy variables indicating whether the corresponding original variable ever has that value.
After completing the hash object, generate a list of variable names matching the pattern of dummies, then do a frequency of that variable list.
dm 'clear log;clear out;';
data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
run;
data dummies;
retain value dummy1-dummy4 .;
stop;
run;
data want (keep=value groupsize varlist);
set have end=end_of_have;
array vars {*} _numeric_ ;
if _n_=1 then do;
if 0 then set dummies;
array dum {*} dummy: ;
declare hash h (dataset:'dummies',ordered:'a');
h.definekey('value');
h.definedata(all:'Y');
h.definedone();
declare hiter hi ('h');
end;
do v=1 to dim(vars);
if vars{v}=. then continue; /*Skip to next iteration */
value=vars{v};
if h.find()^=0 then call missing(of dum{*});
dum{v}=1;
h.replace();
end;
if end_of_have;
*Build an array of variable names *;
array vnames {4} $32 ;
length varlist $200 ;
do v=1 to dim(vars);
vnames{v}=vname(vars{v});
end;
do rc=hi.first() by 0 until (hi.next()^=0);
groupsize=sum(of dummy:);
if groupsize=1 then continue;
varlist=' ';
do v=1 to dim(vars);
if dum{v}=1 then varlist=catx(',',varlist,vnames{v});
end;
output;
end;
run;
proc freq data=want;
tables varlist;
run;
This program finds groups of shared values of all sizes, not just dyads.
I presume that if you have a value common to three variables, it is noted as a triple, but NOT as the three doubles that could be extracted from the triple. I.e. no double counting.
/*It is IML thing.*/
data have;
input var1 var2 var3 var4;
datalines;
1 5 2 1
2 6 3 2
5 19 6 4
6 20 10 6
7 25 13 7
8 26 14 8
12 28 17 12
15 . 21 13
17 . 22 28
21 . 23 30
22 . 25 .
23 . 27 .
25 . 28 .
29 . 30 .
30 . . .
;
run;
/*For combinations of TWO variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-1;
do j=i+1 to n;
set1=set1//vname[i];
set2=set2//vname[j];
a=x[,i];a=a[loc(a^=.)];
b=x[,j];b=b[loc(b^=.)];
shared_count=shared_count//sum(element(a,b));
end;
end;
create want_two var {set1 set2 shared_count};
append;
close;
quit;
/*For combinations of THREE variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-2;
do j=i+1 to n-1;
do k=j+1 to n;
set1=set1//vname[i];
set2=set2//vname[j];
set3=set3//vname[k];
a=x[,i];a=a[loc(a^=.)];
b=x[,j];b=b[loc(b^=.)];
c=x[,k];c=c[loc(c^=.)];
shared_count=shared_count//sum(element(a,b) & element(a,c));
end;
end;
end;
create want_three var {set1 set2 set3 shared_count};
append;
close;
quit;
You could learn SAS/IML from Rick Wicklin's blog .
https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/bd-p/sas_iml
data have;
input var1-var6;
datalines;
1 5 2 1 1 1
2 6 3 2 2 2
5 19 6 4 4 4
6 20 10 6 6 6
7 25 13 7 7 7
8 26 14 8 8 8
12 28 17 12 12 12
15 . 21 13 13 13
17 . 22 28 28 28
21 . 23 30 30 30
22 . 25 . . .
23 . 27 . . .
25 . 28 . . .
29 . 30 . . .
30 . . . . .
;
run;
/*For combinations of FOUR variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-3;
do j=i+1 to n-2;
do k=j+1 to n-1;
do l=k+1 to n;
set1=set1//vname[i];
set2=set2//vname[j];
set3=set3//vname[k];
set4=set4//vname[l];
a=x[,i];a=a[loc(a^=.)];
b=x[,j];b=b[loc(b^=.)];
c=x[,k];c=c[loc(c^=.)];
d=x[,l];d=d[loc(d^=.)];
shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d));
end;
end;
end;
end;
create want_four var {set1 set2 set3 set4 shared_count};
append;
close;
quit;
/*For combinations of FIVE variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-4;
do j=i+1 to n-3;
do k=j+1 to n-2;
do l=k+1 to n-1;
do m=l+1 to n;
set1=set1//vname[i];
set2=set2//vname[j];
set3=set3//vname[k];
set4=set4//vname[l];
set5=set5//vname[m];
a=x[,i];a=a[loc(a^=.)];
b=x[,j];b=b[loc(b^=.)];
c=x[,k];c=c[loc(c^=.)];
d=x[,l];d=d[loc(d^=.)];
e=x[,m];e=e[loc(e^=.)];
shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d) & element(a,e));
end;
end;
end;
end;
end;
create want_five var {set1 set2 set3 set4 set5 shared_count};
append;
close;
quit;
Thank you!
I was struggling with how to define the elements in
sum(element(a,b) & element(a,c));
for the higher order combinations. Your additional examples for combinations of four and five variables demonstrate how to do it.
Additional notes.
Thank you to all. The community's generosity is always amazing.
KSharp answered the original question exactly and his solution is very economical. It requires understanding IML.
Tom also answered the question exactly, and his solution is easier to understand and modify for someone with my current skill level.
Reeza's suggestion to use PROC DISTANCE does provide a great solution to the problem that justified my question.
In fact, one would use PROC DISTANCE to compute the Jaccard index, the PROC CLUSTER to inform how to group tasks optimally.
In other words Reeza's suggestion leads to a better solution than the one that justified my question.
That said, I learned several new methods and tips that I am saving to my set of favorite snippets.
I know you already have a solution, but this one may work as well, and can be done without IML:
First, put an ID on the records (if you do not already have one), and transpose to long format:
data with_id;
set have;
id=_N_;
run;
proc transpose data=with_id out=long;
by id;
var var1-var4;
run;
Then sort by the id and the values, and count:
proc sort data=long;
by id col1;
where col1 is not null;
run;
data all;
do N=1 by 1 until(last.col1);
set long;
by id col1;
length cols $200;
if N<6 then /* we only want the first 5 */
call catx(' ',cols,_name_);
end;
if N>1;
n=min(N,5); /* we only count up to 5 */
run;
(if you drop the two lines with the comments, you will get all combinations, not just the first 5)
And then summarize:
proc summary data=all nway;
class n cols;
output out=want(rename=(_freq_=shared_count) drop=_TYPE_);
run;
I think this solution is simpler, and it will get you all the combinations you want in one fell swoop.
Sorry, I misunderstood the problem.
A question: How do you get the number 9 for "var1,var3"? With the data below I only find 7 different combinations, and then there are two different observations where var1=25, giving a total of 8.
I used this data:
data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;
I changed the code after the sort to this:
data comb2(keep=col1 set1 set2)
comb3(keep=col1 set1 set2 set3)
comb4(keep=col1 set1 set2 set3 set4)
comb5(keep=col1 set1 set2 set3 set4 set5);
n=0;
do until(last.col1);
set long;
by col1 _name_;
n+1;
length cols $200;
call catx(' ',cols,_name_);
end;
if N>1;
do I1=1 to N-1;
set1=scan(cols,I1);
do I2=I1+1 to N;
set2=scan(cols,I2);
if set1=set2 then continue;
output comb2;
do I3=I2+1 to N;
set3=scan(cols,I3);
if set3=set2 then continue;
output comb3;
do I4=I3+1 to N;
set4=scan(cols,I4);
if set4=set3 then continue;
output comb4;
do I5=I4+1 to N;
set5=scan(cols,I5);
if set5=set4 then continue;
output comb5;
end;
end;
end;
end;
end;
run;
proc summary data=comb2 nway;
class set1 set2;
output out=want2(drop=_type_ rename=(_freq_=shared_count));
run;
which seems to get what I think is the right result, but not the one you stipulated.
Sorry, you already have a solution; thank you for taking the time to answer anyway.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.