BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Jean-Jacques
Obsidian | Level 7

Data mavens,

I thought this was going to be simple but I am stuck.
I have datasets with 4-40 variables. Observations up to 10,000.
All variables are numerical. There are no duplicates within any variable, but many duplicates across.
Those shared values between variables are  expected and I do not want to eliminate them.
Some variables have many missings.

For any of those datasets, I want a count of the number of values that are identical in combinations of variables.
In other words, looking at each variable as a set, I want the count of points in the intersection of combinations of variables.
First, all pairwise combinations, then all triplets, and up to four way combinations, but not necessarily all possible combinations when the number of variables gets larger than five.

The "sets" to which each variable corresponds are used for planning work tasks. The purpose of looking at intersections is to help users decide whether to merge some of those tasks to avoid duplication of effort. If the intersection is large, then they would merge the two sets of tasks.

So, starting with this "have" data, the "want" data for all pairwise combinations looks like the following. The exact shape of that "want" could be different.

data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;

data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 4
var1 var3 9
var1 var4 8
var2 var3 3
var2 var4 2
var3 var4 5
;
run;

Is it really simple, and I am not seeing it?
Thank you all.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

You could learn SAS/IML from Rick Wicklin's blog .

https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/bd-p/sas_iml

 

data have;
input var1-var6;
datalines;
1 5 2     1    1    1
2 6 3     2    2    2 
5 19 6    4    4    4
6 20 10   6    6    6
7 25 13   7    7    7
8 26 14  8    8    8
12 28 17 12   12   12
15 . 21  13   13   13
17 . 22  28   28   28
21 . 23  30   30   30
22 . 25  .    .    .
23 . 27  .    .    .
25 . 28  .    .    .
29 . 30  .    .    .
30 . .   .    .    .
;
run;
/*For  combinations of FOUR variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-3;
 do j=i+1 to n-2;
  do k=j+1 to n-1;
   do l=k+1 to n;
    set1=set1//vname[i];
    set2=set2//vname[j];
    set3=set3//vname[k];
	set4=set4//vname[l];

    a=x[,i];a=a[loc(a^=.)];
    b=x[,j];b=b[loc(b^=.)];
    c=x[,k];c=c[loc(c^=.)];
    d=x[,l];d=d[loc(d^=.)];

    shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d)); 
   end;
  end;
 end;
end;
create want_four var {set1 set2 set3 set4 shared_count};
append;
close;
quit;

/*For  combinations of FIVE variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-4;
 do j=i+1 to n-3;
  do k=j+1 to n-2;
   do l=k+1 to n-1;
    do m=l+1 to n;
     set1=set1//vname[i];
     set2=set2//vname[j];
     set3=set3//vname[k];
 	 set4=set4//vname[l];
 	 set5=set5//vname[m];

     a=x[,i];a=a[loc(a^=.)];
     b=x[,j];b=b[loc(b^=.)];
     c=x[,k];c=c[loc(c^=.)];
     d=x[,l];d=d[loc(d^=.)];
     e=x[,m];e=e[loc(e^=.)];

    shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d) & element(a,e)); 
	end;
   end;
  end;
 end;
end;
create want_five var {set1 set2 set3 set4 set5 shared_count};
append;
close;
quit;

View solution in original post

28 REPLIES 28
yabwon
Onyx | Level 15

Possible approach:

data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;


data w1;
set have;

array v var1--var4;

do over v;
    if v NE . then
    do;
    val = v;
    vnm = vname(v);
    output;
    end;
end;
keep val vnm;
run;

proc sql;
create table w2 as
select 
  a.vnm as v1,
  b.vnm as v2,
  a.val 
from
w1 as a
inner join
w1 as b
on a.val=b.val
having a.vnm <> b.vnm
order by 3,1,2
;
run;

data w3;
set w2;
call sortc(v1,v2);
run;

proc sort data=w3 nodupkey;
  by v1 v2 val;
run;

proc sql;
select v1, v2, count(distinct val) as common_values
from w3
group by v1, v2;
quit;

[EDIT:] it will give you number of "unique common values", if var1 is 3,3,3, and var2 is 3,3,3 you will get you 1.

 

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



yabwon
Onyx | Level 15

I did small test run with 40 by 10000 data, looks quite good (done on my laptop):

 

1
2    data have;
3    array var[40];
4    do _N_=1 to 10000;
5
6    var1=_N_;
7
8    do i=2 to 40; drop i;
9      if ranuni(123)>0.5 then var[i]=.;
10                        else var[i]=_N_;
11   end;
12
13   output;
14   end;
15   run;

NOTE: The data set WORK.HAVE has 10000 observations and 40 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


16
17   options fullstimer;
18   data w1;
19   set have;
20
21   array v var:;
22
23   do over v;
24       if v NE . then
25       do;
26       val = v;
27       vnm = vname(v);
28       output;
29       end;
30   end;
31   keep val vnm;
32   run;

NOTE: There were 10000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W1 has 205049 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      user cpu time       0.01 seconds
      system cpu time     0.03 seconds
      memory              677.59k
      OS Memory           24056.00k
      Timestamp           09/25/2023 07:30:37 PM
      Step Count                        22  Switch Count  0


33
34   proc sql;
35   create table w2 as
36   select
37     a.vnm as v1,
38     b.vnm as v2,
39     a.val
40   from
41   w1 as a
42   inner join
43   w1 as b
44   on a.val=b.val
45   having a.vnm <> b.vnm
46   order by 3,1,2
47   ;
NOTE: The "<>" operator is interpreted as "not equals".
NOTE: Table WORK.W2 created, with 4098054 rows and 3 columns.

48   run;
NOTE: PROC SQL statements are executed immediately; The RUN statement has no effect.
49
NOTE: PROCEDURE SQL used (Total process time):
      real time           1.34 seconds
      user cpu time       1.85 seconds
      system cpu time     1.39 seconds
      memory              3387087.71k
      OS Memory           3409428.00k
      Timestamp           09/25/2023 07:30:39 PM
      Step Count                        23  Switch Count  0


50   data w3;
51   set w2;
52   call sortc(v1,v2);
53   run;

NOTE: There were 4098054 observations read from the data set WORK.W2.
NOTE: The data set WORK.W3 has 4098054 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.76 seconds
      user cpu time       0.23 seconds
      system cpu time     0.53 seconds
      memory              630.15k
      OS Memory           24056.00k
      Timestamp           09/25/2023 07:30:39 PM
      Step Count                        24  Switch Count  0


54
55   proc sort data=w3 nodupkey;
56     by v1 v2 val;
57   run;

NOTE: There were 4098054 observations read from the data set WORK.W3.
NOTE: 2049027 observations with duplicate key values were deleted.
NOTE: The data set WORK.W3 has 2049027 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           2.11 seconds
      user cpu time       2.79 seconds
      system cpu time     1.40 seconds
      memory              3409925.96k
      OS Memory           3433192.00k
      Timestamp           09/25/2023 07:30:42 PM
      Step Count                        25  Switch Count  0


58
59   proc sql;
60   select v1, v2, count(distinct val) as common_values
61   from w3
62   group by v1, v2;
63   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.79 seconds
      user cpu time       0.67 seconds
      system cpu time     0.11 seconds
      memory              5412.62k
      OS Memory           29180.00k
      Timestamp           09/25/2023 07:30:42 PM
      Step Count                        26  Switch Count  0

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Jean-Jacques
Obsidian | Level 7

This seems to work as expected, and fast.
I am glad to see that it is a lot less simple than what I had been trying.

Next I need to figure out how to generate triplets. Your suggestions are appreciated !

 

yabwon
Onyx | Level 15

The same way as pairs (but now space requirements grows faster - read comments in the code)

 

data w1;
set have;

array v var1--var40;

do over v;
    if v NE . then
    do;
    val = v;
    length vnm $ 5;  /* if you know that variables names are shorter than make it shorter for space saving,
                        especially for next steps
                        with length 32 w1 is 7.9GB, w3 is 1.3GB
                        with length  5 w1 is only 3.3MB!! w3 is 307MB
                      */
    vnm = vname(v);
    output;
    end;
end;
keep val vnm;
run;

proc sql;
create table temp1 as
  select 
    a.vnm as v1,
    b.vnm as v2,
    a.val 
  from
  w1 as a
  inner join
  w1 as b
  on a.val=b.val
  having a.vnm <> b.vnm

;

create table w2 as
select 
  x.v1,
  x.v2,
  y.vnm as v3,
  x.val 
from
temp1 as x
inner join
w1 as y
on x.val=y.val
having y.vnm <> x.v1 and y.vnm <> x.v2

order by 4,1,2,3
;
run;

proc delete data=w1;
run;

data w3;
set w2;
call sortc(v1,v2,v3);
run;

proc delete data=w2;
run;

proc sort data=w3 nodupkey;
  by v1 v2 v3 val;
run;

proc sql;
select v1, v2, v3, count(distinct val) as common_values
from w3
group by v1, v2, v3;
quit;

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Tom
Super User Tom
Super User
NOTE: The "<>" operator is interpreted as "not equals".
yabwon
Onyx | Level 15

yes, that's the intention, I want: "a.var<>b.var" (not equal) to avoid rows like : "var1", "var1", 5

 

B.

 

P.S.:

Code:

resetline;
data have;
input x y;
cards;
1 0
1 1
0 1
0 0
;
run;

data want;
set have;
where x <> y;
z = x<>y;
run;

Log:

1    data have;
2    input x y;
3    cards;

NOTE: The data set WORK.HAVE has 4 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              408.90k
      OS Memory           36604.00k

8    ;
9    run;
10
11   data want;
12   set have;
13   where x <> y;
NOTE: The "<>" operator is interpreted as "not equals".
14   z = x<>y;
NOTE: The "<>" operator is interpreted as "MAX".
15   run;

NOTE: There were 2 observations read from the data set WORK.HAVE.
      WHERE x not = y;
NOTE: The data set WORK.WANT has 2 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              575.65k
      OS Memory           36604.00k

#SASlife ... 😄 😄 😄

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Tom
Super User Tom
Super User

PROC SUMMARY might do it. Works fine for your simple case.  But with 40 * 10,0000 values it might run out of memory.

proc summary data=have chartype;
 class var1-var4 / missing;
 ways 2;
 output out=want;
run;

Which will show all combinations of 2 variables.  The _TYPE_ variable will indicate which variables are included.

I am not sure how you got the counts you show.   If I count home many observations have two non-missing values of each pair.

proc freq data=step1;
  where 2=n(var1,var2,var3,var4);
  tables _type_ / out=want noprint;
run;

I get:

Obs    _TYPE_    COUNT    PERCENT

 1      0011       10     18.1818
 2      0101        7     12.7273
 3      0110        7     12.7273
 4      1001       10     18.1818
 5      1010       14     25.4545
 6      1100        7     12.7273

So 7 combinations of VAR1 and VAR2.

 72      1       .       .       .      1100        1
 73      2       .       .       .      1100        1
 74      5       .       .       .      1100        1
 75      6      19       .       .      1100        1
 76      7       6       .       .      1100        1
 77      8       .       .       .      1100        1
 78     12       .       .       .      1100        1
 79     15      26       .       .      1100        1
 80     21       .       .       .      1100        1
 81     22       .       .       .      1100        1
 82     23      25       .       .      1100        1
 83     25       5       .       .      1100        1
 84     25      28       .       .      1100        1
 85     29       .       .       .      1100        1
 86     30      20       .       .      1100        1


 

Are you considering VAR=1 and VAR=2 the same as VAR1=2 and VAR2=1 ?

 

Jean-Jacques
Obsidian | Level 7

Apologies! 
As Tom noticed, some of my counts are incorrect.
I mentioned in the post that there are no duplicates within each variable, and my "have" data had some.
Here is the corrected data.

As Tom pointed out, PROC SUMMARY gets overwhelmed when I try it with the full datasets.

data have;
input var1 var2 var3 var4;
datalines;
1 5 2 1
2 6 3 2
5 19 6 4
6 20 10 6
7 25 13 7
8 26 14 8
12 28 17 12
15 . 21 13
17 . 22 28
21 . 23 30
22 . 25 .
23 . 27 .
25 . 28 .
29 . 30 .
30 . . .

;
run;

data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 3
var1 var3 8
var1 var4 7
var2 var3 3
var2 var4 2
var3 var4 5
;
run;





Reeza
Super User
Are your counts here correct? How is shared calculated?
Jean-Jacques
Obsidian | Level 7

Hope I am not getting them wrong again. I am just doing a manual count.
the shared values between var1 and var2 are 5, 6, 25, thus count=3

Tom
Super User Tom
Super User

I do not understand what your data structure means.    You seem to be treating this data as 4 independent one variable datasets.

 

What is the meaning the first observation.

var1 var2 var3 var4;
1 5 2 1

 

Jean-Jacques
Obsidian | Level 7

>
You seem to be treating this data as 4 independent one variable datasets
<
In a way, yes, that's correct. See post.

The observations don't really "matter". All I want to know is how many points are in the intersection of any two, three, or four of those sets.
Thus in the one observation you show, the intersection of var 1 and var2 is empty, but the intersection of var1 and var4 contains 1 point (whose value happens to be 1, so that's confusing)

Tom
Super User Tom
Super User

So you have 40 lists of IDS.

Might be easier if made it into a single list with two variables.

If you have variable that indicates each original observation you could use PROC TRANSPOSE.

If you don't have one then make one:

data have;
  row+1;
  set have;
run;

If you transpose you can then sort by the actual ID value.

proc transpose data=have out=tall(rename=(col1=ID) where=(not missing(id)));
  by row;
  var var: ;
run;

proc sort;
  by id _name_;
run;

A compact why to store them might be in a series of BOOLEAN variables instead.  

 

If the IDs are contiguous (like in your example they run from 1 to 30) then an array might be a good way to store them.

data boolean;
  length id 8 ;
  array flag [30,4] _temporary_ (%eval(30*4)*0);
  array var [4] ;
  set have end=eof;
  do i=1 to 4; 
    id = var[i];
    if id then flag[id,i]=1;
  end;
  if eof then do id=1 to 30;
    do i=1 to 4;
      var[i]=flag[id,i];
    end;
    if max(of var[*]) then output;
    keep id var1-var4;
  end;
run;

Result

Obs    id    var1    var2    var3    var4

  1     1      1       0       0       1
  2     2      1       0       1       1
  3     3      0       0       1       0
  4     4      0       0       0       1
  5     5      1       1       0       0
  6     6      1       1       1       1
  7     7      1       0       0       1
  8     8      1       0       0       1
  9    10      0       0       1       0
 10    12      1       0       0       1
 11    13      0       0       1       1
 12    14      0       0       1       0
 13    15      1       0       0       0
 14    17      1       0       1       0
 15    19      0       1       0       0
 16    20      0       1       0       0
 17    21      1       0       1       0
 18    22      1       0       1       0
 19    23      1       0       1       0
 20    25      1       1       1       0
 21    26      0       1       0       0
 22    27      0       0       1       0
 23    28      0       1       1       1
 24    29      1       0       0       0
 25    30      1       0       1       1

Now it is simple to get counts for combinations.

proc sql;
create tables counts as
select sum(var1 and var2) as var1_var2
        , sum(var1 and var3) as var1_var3
from boolean
;
quit;
       var1_    var1_
Obs     var2     var3

 1       3        8
Jean-Jacques
Obsidian | Level 7

@Tom wrote:

So you have 40 lists of IDS.

Might be easier if made it into a single list with two variables.

Yes, that is a much better way to describe the data.

I like the simplicity of this approach a lot.
Two questions:
1. In the final proc sql step, I would have to spell out all the combinations I want. Suggestions on how to generate all pairwise combinations? All triplets, etc...
2. Unfortunately, my variables are not named as conveniently as var1-varn. Suggestions on handling that as economically as possible?
3. Likewise, my IDs are numerical, but not as friendly as 1-30. Should I pull the min and max as macro variables and use them in the array? 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 28 replies
  • 1969 views
  • 15 likes
  • 7 in conversation