Data mavens, I thought this was going to be simple but I am stuck. I have datasets with 4-40 variables. Observations up to 10,000. All variables are numerical. There are no duplicates within any variable, but many duplicates across. Those shared values between variables are expected and I do not want to eliminate them. Some variables have many missings. For any of those datasets, I want a count of the number of values that are identical in combinations of variables. In other words, looking at each variable as a set, I want the count of points in the intersection of combinations of variables. First, all pairwise combinations, then all triplets, and up to four way combinations, but not necessarily all possible combinations when the number of variables gets larger than five. The "sets" to which each variable corresponds are used for planning work tasks. The purpose of looking at intersections is to help users decide whether to merge some of those tasks to avoid duplication of effort. If the intersection is large, then they would merge the two sets of tasks. So, starting with this "have" data, the "want" data for all pairwise combinations looks like the following. The exact shape of that "want" could be different. data have;
input var1 var2 var3 var4;
datalines;
15 26 3 13
25 28 28 1
30 20 27 12
25 5 10 4
7 6 22 28
6 19 17 7
23 25 6 2
12 . 25 30
2 . 23 8
5 . 30 6
21 . 14 .
8 . 13 .
22 . 2 .
29 . 21 .
1 . . .
;
run;
data want;
input set1 $ set2 $ shared_count;
datalines;
var1 var2 4
var1 var3 9
var1 var4 8
var2 var3 3
var2 var4 2
var3 var4 5
;
run; Is it really simple, and I am not seeing it? Thank you all.
... View more