Data mavens,
I thought this was going to be simple but I am stuck.
I have datasets with 4-40 variables. Observations up to 10,000.
All variables are numerical. There are no duplicates within any variable, but many duplicates across.
Those shared values between variables are expected and I do not want to eliminate them.
Some variables have many missings.
For any of those datasets, I want a count of the number of values that are identical in combinations of variables.
In other words, looking at each variable as a set, I want the count of points in the intersection of combinations of variables.
First, all pairwise combinations, then all triplets, and up to four way combinations, but not necessarily all possible combinations when the number of variables gets larger than five.
The "sets" to which each variable corresponds are used for planning work tasks. The purpose of looking at intersections is to help users decide whether to merge some of those tasks to avoid duplication of effort. If the intersection is large, then they would merge the two sets of tasks.
So, starting with this "have" data, the "want" data for all pairwise combinations looks like the following. The exact shape of that "want" could be different.
data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;
data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 4
var1 var3 9
var1 var4 8
var2 var3 3
var2 var4 2
var3 var4 5
;
run;
Is it really simple, and I am not seeing it?
Thank you all.
You could learn SAS/IML from Rick Wicklin's blog .
https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/bd-p/sas_iml
data have;
input var1-var6;
datalines;
1 5 2 1 1 1
2 6 3 2 2 2
5 19 6 4 4 4
6 20 10 6 6 6
7 25 13 7 7 7
8 26 14 8 8 8
12 28 17 12 12 12
15 . 21 13 13 13
17 . 22 28 28 28
21 . 23 30 30 30
22 . 25 . . .
23 . 27 . . .
25 . 28 . . .
29 . 30 . . .
30 . . . . .
;
run;
/*For combinations of FOUR variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-3;
do j=i+1 to n-2;
do k=j+1 to n-1;
do l=k+1 to n;
set1=set1//vname[i];
set2=set2//vname[j];
set3=set3//vname[k];
set4=set4//vname[l];
a=x[,i];a=a[loc(a^=.)];
b=x[,j];b=b[loc(b^=.)];
c=x[,k];c=c[loc(c^=.)];
d=x[,l];d=d[loc(d^=.)];
shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d));
end;
end;
end;
end;
create want_four var {set1 set2 set3 set4 shared_count};
append;
close;
quit;
/*For combinations of FIVE variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-4;
do j=i+1 to n-3;
do k=j+1 to n-2;
do l=k+1 to n-1;
do m=l+1 to n;
set1=set1//vname[i];
set2=set2//vname[j];
set3=set3//vname[k];
set4=set4//vname[l];
set5=set5//vname[m];
a=x[,i];a=a[loc(a^=.)];
b=x[,j];b=b[loc(b^=.)];
c=x[,k];c=c[loc(c^=.)];
d=x[,l];d=d[loc(d^=.)];
e=x[,m];e=e[loc(e^=.)];
shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d) & element(a,e));
end;
end;
end;
end;
end;
create want_five var {set1 set2 set3 set4 set5 shared_count};
append;
close;
quit;
Possible approach:
data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;
data w1;
set have;
array v var1--var4;
do over v;
if v NE . then
do;
val = v;
vnm = vname(v);
output;
end;
end;
keep val vnm;
run;
proc sql;
create table w2 as
select
a.vnm as v1,
b.vnm as v2,
a.val
from
w1 as a
inner join
w1 as b
on a.val=b.val
having a.vnm <> b.vnm
order by 3,1,2
;
run;
data w3;
set w2;
call sortc(v1,v2);
run;
proc sort data=w3 nodupkey;
by v1 v2 val;
run;
proc sql;
select v1, v2, count(distinct val) as common_values
from w3
group by v1, v2;
quit;
[EDIT:] it will give you number of "unique common values", if var1 is 3,3,3, and var2 is 3,3,3 you will get you 1.
Bart
I did small test run with 40 by 10000 data, looks quite good (done on my laptop):
1
2 data have;
3 array var[40];
4 do _N_=1 to 10000;
5
6 var1=_N_;
7
8 do i=2 to 40; drop i;
9 if ranuni(123)>0.5 then var[i]=.;
10 else var[i]=_N_;
11 end;
12
13 output;
14 end;
15 run;
NOTE: The data set WORK.HAVE has 10000 observations and 40 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
16
17 options fullstimer;
18 data w1;
19 set have;
20
21 array v var:;
22
23 do over v;
24 if v NE . then
25 do;
26 val = v;
27 vnm = vname(v);
28 output;
29 end;
30 end;
31 keep val vnm;
32 run;
NOTE: There were 10000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W1 has 205049 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
user cpu time 0.01 seconds
system cpu time 0.03 seconds
memory 677.59k
OS Memory 24056.00k
Timestamp 09/25/2023 07:30:37 PM
Step Count 22 Switch Count 0
33
34 proc sql;
35 create table w2 as
36 select
37 a.vnm as v1,
38 b.vnm as v2,
39 a.val
40 from
41 w1 as a
42 inner join
43 w1 as b
44 on a.val=b.val
45 having a.vnm <> b.vnm
46 order by 3,1,2
47 ;
NOTE: The "<>" operator is interpreted as "not equals".
NOTE: Table WORK.W2 created, with 4098054 rows and 3 columns.
48 run;
NOTE: PROC SQL statements are executed immediately; The RUN statement has no effect.
49
NOTE: PROCEDURE SQL used (Total process time):
real time 1.34 seconds
user cpu time 1.85 seconds
system cpu time 1.39 seconds
memory 3387087.71k
OS Memory 3409428.00k
Timestamp 09/25/2023 07:30:39 PM
Step Count 23 Switch Count 0
50 data w3;
51 set w2;
52 call sortc(v1,v2);
53 run;
NOTE: There were 4098054 observations read from the data set WORK.W2.
NOTE: The data set WORK.W3 has 4098054 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.76 seconds
user cpu time 0.23 seconds
system cpu time 0.53 seconds
memory 630.15k
OS Memory 24056.00k
Timestamp 09/25/2023 07:30:39 PM
Step Count 24 Switch Count 0
54
55 proc sort data=w3 nodupkey;
56 by v1 v2 val;
57 run;
NOTE: There were 4098054 observations read from the data set WORK.W3.
NOTE: 2049027 observations with duplicate key values were deleted.
NOTE: The data set WORK.W3 has 2049027 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 2.11 seconds
user cpu time 2.79 seconds
system cpu time 1.40 seconds
memory 3409925.96k
OS Memory 3433192.00k
Timestamp 09/25/2023 07:30:42 PM
Step Count 25 Switch Count 0
58
59 proc sql;
60 select v1, v2, count(distinct val) as common_values
61 from w3
62 group by v1, v2;
63 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.79 seconds
user cpu time 0.67 seconds
system cpu time 0.11 seconds
memory 5412.62k
OS Memory 29180.00k
Timestamp 09/25/2023 07:30:42 PM
Step Count 26 Switch Count 0
Bart
This seems to work as expected, and fast.
I am glad to see that it is a lot less simple than what I had been trying.
Next I need to figure out how to generate triplets. Your suggestions are appreciated !
The same way as pairs (but now space requirements grows faster - read comments in the code)
data w1;
set have;
array v var1--var40;
do over v;
if v NE . then
do;
val = v;
length vnm $ 5; /* if you know that variables names are shorter than make it shorter for space saving,
especially for next steps
with length 32 w1 is 7.9GB, w3 is 1.3GB
with length 5 w1 is only 3.3MB!! w3 is 307MB
*/
vnm = vname(v);
output;
end;
end;
keep val vnm;
run;
proc sql;
create table temp1 as
select
a.vnm as v1,
b.vnm as v2,
a.val
from
w1 as a
inner join
w1 as b
on a.val=b.val
having a.vnm <> b.vnm
;
create table w2 as
select
x.v1,
x.v2,
y.vnm as v3,
x.val
from
temp1 as x
inner join
w1 as y
on x.val=y.val
having y.vnm <> x.v1 and y.vnm <> x.v2
order by 4,1,2,3
;
run;
proc delete data=w1;
run;
data w3;
set w2;
call sortc(v1,v2,v3);
run;
proc delete data=w2;
run;
proc sort data=w3 nodupkey;
by v1 v2 v3 val;
run;
proc sql;
select v1, v2, v3, count(distinct val) as common_values
from w3
group by v1, v2, v3;
quit;
Bart
NOTE: The "<>" operator is interpreted as "not equals".
yes, that's the intention, I want: "a.var<>b.var" (not equal) to avoid rows like : "var1", "var1", 5
B.
P.S.:
Code:
resetline;
data have;
input x y;
cards;
1 0
1 1
0 1
0 0
;
run;
data want;
set have;
where x <> y;
z = x<>y;
run;
Log:
1 data have; 2 input x y; 3 cards; NOTE: The data set WORK.HAVE has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 408.90k OS Memory 36604.00k 8 ; 9 run; 10 11 data want; 12 set have; 13 where x <> y; NOTE: The "<>" operator is interpreted as "not equals". 14 z = x<>y; NOTE: The "<>" operator is interpreted as "MAX". 15 run; NOTE: There were 2 observations read from the data set WORK.HAVE. WHERE x not = y; NOTE: The data set WORK.WANT has 2 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 575.65k OS Memory 36604.00k
#SASlife ... 😄 😄 😄
PROC SUMMARY might do it. Works fine for your simple case. But with 40 * 10,0000 values it might run out of memory.
proc summary data=have chartype;
class var1-var4 / missing;
ways 2;
output out=want;
run;
Which will show all combinations of 2 variables. The _TYPE_ variable will indicate which variables are included.
I am not sure how you got the counts you show. If I count home many observations have two non-missing values of each pair.
proc freq data=step1;
where 2=n(var1,var2,var3,var4);
tables _type_ / out=want noprint;
run;
I get:
Obs _TYPE_ COUNT PERCENT 1 0011 10 18.1818 2 0101 7 12.7273 3 0110 7 12.7273 4 1001 10 18.1818 5 1010 14 25.4545 6 1100 7 12.7273
So 7 combinations of VAR1 and VAR2.
72 1 . . . 1100 1 73 2 . . . 1100 1 74 5 . . . 1100 1 75 6 19 . . 1100 1 76 7 6 . . 1100 1 77 8 . . . 1100 1 78 12 . . . 1100 1 79 15 26 . . 1100 1 80 21 . . . 1100 1 81 22 . . . 1100 1 82 23 25 . . 1100 1 83 25 5 . . 1100 1 84 25 28 . . 1100 1 85 29 . . . 1100 1 86 30 20 . . 1100 1
Are you considering VAR=1 and VAR=2 the same as VAR1=2 and VAR2=1 ?
Apologies!
As Tom noticed, some of my counts are incorrect.
I mentioned in the post that there are no duplicates within each variable, and my "have" data had some.
Here is the corrected data.
As Tom pointed out, PROC SUMMARY gets overwhelmed when I try it with the full datasets.
data have;
input var1 var2 var3 var4;
datalines;
1 5 2 1
2 6 3 2
5 19 6 4
6 20 10 6
7 25 13 7
8 26 14 8
12 28 17 12
15 . 21 13
17 . 22 28
21 . 23 30
22 . 25 .
23 . 27 .
25 . 28 .
29 . 30 .
30 . . .
;
run;
data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 3
var1 var3 8
var1 var4 7
var2 var3 3
var2 var4 2
var3 var4 5
;
run;
Hope I am not getting them wrong again. I am just doing a manual count.
the shared values between var1 and var2 are 5, 6, 25, thus count=3
I do not understand what your data structure means. You seem to be treating this data as 4 independent one variable datasets.
What is the meaning the first observation.
var1 var2 var3 var4;
1 5 2 1
>
You seem to be treating this data as 4 independent one variable datasets
<
In a way, yes, that's correct. See post.
The observations don't really "matter". All I want to know is how many points are in the intersection of any two, three, or four of those sets.
Thus in the one observation you show, the intersection of var 1 and var2 is empty, but the intersection of var1 and var4 contains 1 point (whose value happens to be 1, so that's confusing)
So you have 40 lists of IDS.
Might be easier if made it into a single list with two variables.
If you have variable that indicates each original observation you could use PROC TRANSPOSE.
If you don't have one then make one:
data have;
row+1;
set have;
run;
If you transpose you can then sort by the actual ID value.
proc transpose data=have out=tall(rename=(col1=ID) where=(not missing(id)));
by row;
var var: ;
run;
proc sort;
by id _name_;
run;
A compact why to store them might be in a series of BOOLEAN variables instead.
If the IDs are contiguous (like in your example they run from 1 to 30) then an array might be a good way to store them.
data boolean;
length id 8 ;
array flag [30,4] _temporary_ (%eval(30*4)*0);
array var [4] ;
set have end=eof;
do i=1 to 4;
id = var[i];
if id then flag[id,i]=1;
end;
if eof then do id=1 to 30;
do i=1 to 4;
var[i]=flag[id,i];
end;
if max(of var[*]) then output;
keep id var1-var4;
end;
run;
Result
Obs id var1 var2 var3 var4 1 1 1 0 0 1 2 2 1 0 1 1 3 3 0 0 1 0 4 4 0 0 0 1 5 5 1 1 0 0 6 6 1 1 1 1 7 7 1 0 0 1 8 8 1 0 0 1 9 10 0 0 1 0 10 12 1 0 0 1 11 13 0 0 1 1 12 14 0 0 1 0 13 15 1 0 0 0 14 17 1 0 1 0 15 19 0 1 0 0 16 20 0 1 0 0 17 21 1 0 1 0 18 22 1 0 1 0 19 23 1 0 1 0 20 25 1 1 1 0 21 26 0 1 0 0 22 27 0 0 1 0 23 28 0 1 1 1 24 29 1 0 0 0 25 30 1 0 1 1
Now it is simple to get counts for combinations.
proc sql;
create tables counts as
select sum(var1 and var2) as var1_var2
, sum(var1 and var3) as var1_var3
from boolean
;
quit;
var1_ var1_ Obs var2 var3 1 3 8
@Tom wrote:
So you have 40 lists of IDS.
Might be easier if made it into a single list with two variables.
Yes, that is a much better way to describe the data.
I like the simplicity of this approach a lot.
Two questions:
1. In the final proc sql step, I would have to spell out all the combinations I want. Suggestions on how to generate all pairwise combinations? All triplets, etc...
2. Unfortunately, my variables are not named as conveniently as var1-varn. Suggestions on handling that as economically as possible?
3. Likewise, my IDs are numerical, but not as friendly as 1-30. Should I pull the min and max as macro variables and use them in the array?
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.