Solved: Re: Count number of values shared between variables

Jean-Jacques · Posted 09-25-2023 12:17 PM

Data mavens,

I thought this was going to be simple but I am stuck.
I have datasets with 4-40 variables. Observations up to 10,000.
All variables are numerical. There are no duplicates within any variable, but many duplicates across.
Those shared values between variables are expected and I do not want to eliminate them.
Some variables have many missings.

For any of those datasets, I want a count of the number of values that are identical in combinations of variables.
In other words, looking at each variable as a set, I want the count of points in the intersection of combinations of variables.
First, all pairwise combinations, then all triplets, and up to four way combinations, but not necessarily all possible combinations when the number of variables gets larger than five.

The "sets" to which each variable corresponds are used for planning work tasks. The purpose of looking at intersections is to help users decide whether to merge some of those tasks to avoid duplication of effort. If the intersection is large, then they would merge the two sets of tasks.

So, starting with this "have" data, the "want" data for all pairwise combinations looks like the following. The exact shape of that "want" could be different.

data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;

data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 4
var1 var3 9
var1 var4 8
var2 var3 3
var2 var4 2
var3 var4 5
;
run;

Is it really simple, and I am not seeing it?
Thank you all.

Ksharp · Posted 09-27-2023 07:32 AM

You could learn SAS/IML from Rick Wicklin's blog .

https://communities.sas.com/t5/SAS-IML-Software-and-Matrix/bd-p/sas_iml

data have;
input var1-var6;
datalines;
1 5 2     1    1    1
2 6 3     2    2    2 
5 19 6    4    4    4
6 20 10   6    6    6
7 25 13   7    7    7
8 26 14  8    8    8
12 28 17 12   12   12
15 . 21  13   13   13
17 . 22  28   28   28
21 . 23  30   30   30
22 . 25  .    .    .
23 . 27  .    .    .
25 . 28  .    .    .
29 . 30  .    .    .
30 . .   .    .    .
;
run;
/*For  combinations of FOUR variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-3;
 do j=i+1 to n-2;
  do k=j+1 to n-1;
   do l=k+1 to n;
    set1=set1//vname[i];
    set2=set2//vname[j];
    set3=set3//vname[k];
	set4=set4//vname[l];

    a=x[,i];a=a[loc(a^=.)];
    b=x[,j];b=b[loc(b^=.)];
    c=x[,k];c=c[loc(c^=.)];
    d=x[,l];d=d[loc(d^=.)];

    shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d)); 
   end;
  end;
 end;
end;
create want_four var {set1 set2 set3 set4 shared_count};
append;
close;
quit;

/*For  combinations of FIVE variables*/
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
n=ncol(vname);
do i=1 to n-4;
 do j=i+1 to n-3;
  do k=j+1 to n-2;
   do l=k+1 to n-1;
    do m=l+1 to n;
     set1=set1//vname[i];
     set2=set2//vname[j];
     set3=set3//vname[k];
 	 set4=set4//vname[l];
 	 set5=set5//vname[m];

     a=x[,i];a=a[loc(a^=.)];
     b=x[,j];b=b[loc(b^=.)];
     c=x[,k];c=c[loc(c^=.)];
     d=x[,l];d=d[loc(d^=.)];
     e=x[,m];e=e[loc(e^=.)];

    shared_count=shared_count//sum(element(a,b) & element(a,c) & element(a,d) & element(a,e)); 
	end;
   end;
  end;
 end;
end;
create want_five var {set1 set2 set3 set4 set5 shared_count};
append;
close;
quit;

View solution in original post

yabwon · Posted 09-25-2023 12:45 PM

Possible approach:

data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;


data w1;
set have;

array v var1--var4;

do over v;
    if v NE . then
    do;
    val = v;
    vnm = vname(v);
    output;
    end;
end;
keep val vnm;
run;

proc sql;
create table w2 as
select 
  a.vnm as v1,
  b.vnm as v2,
  a.val 
from
w1 as a
inner join
w1 as b
on a.val=b.val
having a.vnm <> b.vnm
order by 3,1,2
;
run;

data w3;
set w2;
call sortc(v1,v2);
run;

proc sort data=w3 nodupkey;
  by v1 v2 val;
run;

proc sql;
select v1, v2, count(distinct val) as common_values
from w3
group by v1, v2;
quit;

[EDIT:] it will give you number of "unique common values", if var1 is 3,3,3, and var2 is 3,3,3 you will get you 1.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

yabwon · Posted 09-25-2023 01:32 PM

I did small test run with 40 by 10000 data, looks quite good (done on my laptop):

1
2    data have;
3    array var[40];
4    do _N_=1 to 10000;
5
6    var1=_N_;
7
8    do i=2 to 40; drop i;
9      if ranuni(123)>0.5 then var[i]=.;
10                        else var[i]=_N_;
11   end;
12
13   output;
14   end;
15   run;

NOTE: The data set WORK.HAVE has 10000 observations and 40 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


16
17   options fullstimer;
18   data w1;
19   set have;
20
21   array v var:;
22
23   do over v;
24       if v NE . then
25       do;
26       val = v;
27       vnm = vname(v);
28       output;
29       end;
30   end;
31   keep val vnm;
32   run;

NOTE: There were 10000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.W1 has 205049 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      user cpu time       0.01 seconds
      system cpu time     0.03 seconds
      memory              677.59k
      OS Memory           24056.00k
      Timestamp           09/25/2023 07:30:37 PM
      Step Count                        22  Switch Count  0


33
34   proc sql;
35   create table w2 as
36   select
37     a.vnm as v1,
38     b.vnm as v2,
39     a.val
40   from
41   w1 as a
42   inner join
43   w1 as b
44   on a.val=b.val
45   having a.vnm <> b.vnm
46   order by 3,1,2
47   ;
NOTE: The "<>" operator is interpreted as "not equals".
NOTE: Table WORK.W2 created, with 4098054 rows and 3 columns.

48   run;
NOTE: PROC SQL statements are executed immediately; The RUN statement has no effect.
49
NOTE: PROCEDURE SQL used (Total process time):
      real time           1.34 seconds
      user cpu time       1.85 seconds
      system cpu time     1.39 seconds
      memory              3387087.71k
      OS Memory           3409428.00k
      Timestamp           09/25/2023 07:30:39 PM
      Step Count                        23  Switch Count  0


50   data w3;
51   set w2;
52   call sortc(v1,v2);
53   run;

NOTE: There were 4098054 observations read from the data set WORK.W2.
NOTE: The data set WORK.W3 has 4098054 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.76 seconds
      user cpu time       0.23 seconds
      system cpu time     0.53 seconds
      memory              630.15k
      OS Memory           24056.00k
      Timestamp           09/25/2023 07:30:39 PM
      Step Count                        24  Switch Count  0


54
55   proc sort data=w3 nodupkey;
56     by v1 v2 val;
57   run;

NOTE: There were 4098054 observations read from the data set WORK.W3.
NOTE: 2049027 observations with duplicate key values were deleted.
NOTE: The data set WORK.W3 has 2049027 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           2.11 seconds
      user cpu time       2.79 seconds
      system cpu time     1.40 seconds
      memory              3409925.96k
      OS Memory           3433192.00k
      Timestamp           09/25/2023 07:30:42 PM
      Step Count                        25  Switch Count  0


58
59   proc sql;
60   select v1, v2, count(distinct val) as common_values
61   from w3
62   group by v1, v2;
63   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.79 seconds
      user cpu time       0.67 seconds
      system cpu time     0.11 seconds
      memory              5412.62k
      OS Memory           29180.00k
      Timestamp           09/25/2023 07:30:42 PM
      Step Count                        26  Switch Count  0

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Jean-Jacques · Posted 09-25-2023 02:14 PM

This seems to work as expected, and fast.
I am glad to see that it is a lot less simple than what I had been trying.

Next I need to figure out how to generate triplets. Your suggestions are appreciated !

yabwon · Posted 09-26-2023 02:58 AM

The same way as pairs (but now space requirements grows faster - read comments in the code)

data w1;
set have;

array v var1--var40;

do over v;
    if v NE . then
    do;
    val = v;
    length vnm $ 5;  /* if you know that variables names are shorter than make it shorter for space saving,
                        especially for next steps
                        with length 32 w1 is 7.9GB, w3 is 1.3GB
                        with length  5 w1 is only 3.3MB!! w3 is 307MB
                      */
    vnm = vname(v);
    output;
    end;
end;
keep val vnm;
run;

proc sql;
create table temp1 as
  select 
    a.vnm as v1,
    b.vnm as v2,
    a.val 
  from
  w1 as a
  inner join
  w1 as b
  on a.val=b.val
  having a.vnm <> b.vnm

;

create table w2 as
select 
  x.v1,
  x.v2,
  y.vnm as v3,
  x.val 
from
temp1 as x
inner join
w1 as y
on x.val=y.val
having y.vnm <> x.v1 and y.vnm <> x.v2

order by 4,1,2,3
;
run;

proc delete data=w1;
run;

data w3;
set w2;
call sortc(v1,v2,v3);
run;

proc delete data=w2;
run;

proc sort data=w3 nodupkey;
  by v1 v2 v3 val;
run;

proc sql;
select v1, v2, v3, count(distinct val) as common_values
from w3
group by v1, v2, v3;
quit;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Tom · Posted 09-25-2023 03:18 PM

NOTE: The "<>" operator is interpreted as "not equals".

yabwon · Posted 09-26-2023 02:38 AM

yes, that's the intention, I want: "a.var<>b.var" (not equal) to avoid rows like : "var1", "var1", 5

B.

P.S.:

Code:

resetline;
data have;
input x y;
cards;
1 0
1 1
0 1
0 0
;
run;

data want;
set have;
where x <> y;
z = x<>y;
run;

Log:

1    data have;
2    input x y;
3    cards;

NOTE: The data set WORK.HAVE has 4 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              408.90k
      OS Memory           36604.00k

8    ;
9    run;
10
11   data want;
12   set have;
13   where x <> y;
NOTE: The "<>" operator is interpreted as "not equals".
14   z = x<>y;
NOTE: The "<>" operator is interpreted as "MAX".
15   run;

NOTE: There were 2 observations read from the data set WORK.HAVE.
      WHERE x not = y;
NOTE: The data set WORK.WANT has 2 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              575.65k
      OS Memory           36604.00k

#SASlife ... 😄 😄 😄

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Tom · Posted 09-25-2023 12:45 PM

PROC SUMMARY might do it. Works fine for your simple case. But with 40 * 10,0000 values it might run out of memory.

proc summary data=have chartype;
 class var1-var4 / missing;
 ways 2;
 output out=want;
run;

Which will show all combinations of 2 variables. The _TYPE_ variable will indicate which variables are included.

I am not sure how you got the counts you show. If I count home many observations have two non-missing values of each pair.

proc freq data=step1;
  where 2=n(var1,var2,var3,var4);
  tables _type_ / out=want noprint;
run;

I get:

Obs    _TYPE_    COUNT    PERCENT

 1      0011       10     18.1818
 2      0101        7     12.7273
 3      0110        7     12.7273
 4      1001       10     18.1818
 5      1010       14     25.4545
 6      1100        7     12.7273

So 7 combinations of VAR1 and VAR2.

 72      1       .       .       .      1100        1
 73      2       .       .       .      1100        1
 74      5       .       .       .      1100        1
 75      6      19       .       .      1100        1
 76      7       6       .       .      1100        1
 77      8       .       .       .      1100        1
 78     12       .       .       .      1100        1
 79     15      26       .       .      1100        1
 80     21       .       .       .      1100        1
 81     22       .       .       .      1100        1
 82     23      25       .       .      1100        1
 83     25       5       .       .      1100        1
 84     25      28       .       .      1100        1
 85     29       .       .       .      1100        1
 86     30      20       .       .      1100        1

Are you considering VAR=1 and VAR=2 the same as VAR1=2 and VAR2=1 ?

Jean-Jacques · Posted 09-25-2023 01:11 PM

Apologies!
As Tom noticed, some of my counts are incorrect.
I mentioned in the post that there are no duplicates within each variable, and my "have" data had some.
Here is the corrected data.

As Tom pointed out, PROC SUMMARY gets overwhelmed when I try it with the full datasets.

data have;
input var1 var2 var3 var4;
datalines;
1 5 2 1
2 6 3 2
5 19 6 4
6 20 10 6
7 25 13 7
8 26 14 8
12 28 17 12
15 . 21 13
17 . 22 28
21 . 23 30
22 . 25 .
23 . 27 .
25 . 28 .
29 . 30 .
30 . . .

;
run;

data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 3
var1 var3 8
var1 var4 7
var2 var3 3
var2 var4 2
var3 var4 5
;
run;

Reeza · Posted 09-25-2023 01:21 PM

Are your counts here correct? How is shared calculated?

Jean-Jacques · Posted 09-25-2023 02:10 PM

Hope I am not getting them wrong again. I am just doing a manual count.
the shared values between var1 and var2 are 5, 6, 25, thus count=3

Tom · Posted 09-25-2023 02:54 PM

I do not understand what your data structure means. You seem to be treating this data as 4 independent one variable datasets.

What is the meaning the first observation.

var1 var2 var3 var4;
1 5 2 1

Jean-Jacques · Posted 09-25-2023 03:30 PM

>
You seem to be treating this data as 4 independent one variable datasets
<
In a way, yes, that's correct. See post.

The observations don't really "matter". All I want to know is how many points are in the intersection of any two, three, or four of those sets.
Thus in the one observation you show, the intersection of var 1 and var2 is empty, but the intersection of var1 and var4 contains 1 point (whose value happens to be 1, so that's confusing)

Tom · Posted 09-25-2023 03:59 PM

So you have 40 lists of IDS.

Might be easier if made it into a single list with two variables.

If you have variable that indicates each original observation you could use PROC TRANSPOSE.

If you don't have one then make one:

data have;
  row+1;
  set have;
run;

If you transpose you can then sort by the actual ID value.

proc transpose data=have out=tall(rename=(col1=ID) where=(not missing(id)));
  by row;
  var var: ;
run;

proc sort;
  by id _name_;
run;

A compact why to store them might be in a series of BOOLEAN variables instead.

If the IDs are contiguous (like in your example they run from 1 to 30) then an array might be a good way to store them.

data boolean;
  length id 8 ;
  array flag [30,4] _temporary_ (%eval(30*4)*0);
  array var [4] ;
  set have end=eof;
  do i=1 to 4; 
    id = var[i];
    if id then flag[id,i]=1;
  end;
  if eof then do id=1 to 30;
    do i=1 to 4;
      var[i]=flag[id,i];
    end;
    if max(of var[*]) then output;
    keep id var1-var4;
  end;
run;

Result

Obs    id    var1    var2    var3    var4

  1     1      1       0       0       1
  2     2      1       0       1       1
  3     3      0       0       1       0
  4     4      0       0       0       1
  5     5      1       1       0       0
  6     6      1       1       1       1
  7     7      1       0       0       1
  8     8      1       0       0       1
  9    10      0       0       1       0
 10    12      1       0       0       1
 11    13      0       0       1       1
 12    14      0       0       1       0
 13    15      1       0       0       0
 14    17      1       0       1       0
 15    19      0       1       0       0
 16    20      0       1       0       0
 17    21      1       0       1       0
 18    22      1       0       1       0
 19    23      1       0       1       0
 20    25      1       1       1       0
 21    26      0       1       0       0
 22    27      0       0       1       0
 23    28      0       1       1       1
 24    29      1       0       0       0
 25    30      1       0       1       1

Now it is simple to get counts for combinations.

proc sql;
create tables counts as
select sum(var1 and var2) as var1_var2
        , sum(var1 and var3) as var1_var3
from boolean
;
quit;

       var1_    var1_
Obs     var2     var3

 1       3        8

Jean-Jacques · Posted 09-25-2023 06:13 PM

@Tom wrote:

So you have 40 lists of IDS.

Might be easier if made it into a single list with two variables.

Yes, that is a much better way to describe the data.

I like the simplicity of this approach a lot.
Two questions:
1. In the final proc sql step, I would have to spell out all the combinations I want. Suggestions on how to generate all pairwise combinations? All triplets, etc...
2. Unfortunately, my variables are not named as conveniently as var1-varn. Suggestions on handling that as economically as possible?
3. Likewise, my IDs are numerical, but not as friendly as 1-30. Should I pull the min and max as macro variables and use them in the array?

2025 SAS Hackathon: There is still time!

SAS Training: Just a Click Away