I suspect, for a large dataset, with many GROUPID*INDID values, this is most efficiently generated by (1) sorting by INDID/GROUPID, and (2) for each INDID, find all GROUPID combinations for that INDID and add to a running total for all INDID's.
For each INDID, the second step is basically done by
reading in a single group record
with the current groupid in hand, iterate through a list of all prior groupid's and add to the count of each current/historical group pairs.
add the current group id to the historical list.
Go back to step 1.
Using @ballardw 's dataset HAVE:
data have;
input groupid $ indid $;
datalines;
1 A
1 B
1 C
1 D
2 A
2 B
2 C
3 D
3 E
3 F
4 A
4 C
4 E
4 F
5 B
5 D
5 F
run;
proc sort data=have out=need; by indid groupid;
run;
data _null_;
set need end=end_of_need;
by indid;
if _n_=1 then do;
length groupid2 $8 pair_count 8;
declare hash hist ();
hist.definekey('indid','groupid2');
hist.definedata('indid','groupid2');
hist.definedone();
declare hiter hi ('hist');
declare hash pairs(ordered:'A');
pairs.definekey('groupid','groupid2');
pairs.definedata('groupid','groupid2','pair_count');
pairs.definedone();
end;
/* Now iterate through all the previous GROUPIDs for this INDID */
if not first.indid then do rc=hi.first() by 0 until (hi.next()^=0);
pair_count=0;
rc1=pairs.find(); /* Retrieve running total for this GROUPID,GROUPID2 pair*/
pair_count=pair_count+1;
rc2=pairs.replace();
end;
groupid2=groupid; /* Current groupid becomes part of group history*/
rc3=hist.add(); /* Add current groupid to hash H */
if last.indid then hist.clear(); /*Done with this ID? empty the hash object */
if end_of_need then pairs.output(dataset:'want');
run;
This code keeps memory requirements down, since it needs only enough memory to hold (1) a list off all groupid's for a single indid (in hash object hist), and (2) a table of frequency for all encountered groupid pairs (no more than one-half of N of groups squared) - in hash object pairs.
By contrast, the SQL below, while much less code, is probably a far greater consumer of memory:
proc sql noprint;
create table want as select
catx('-',a.groupid,b.groupid) as group_pair,
count(a.indid) as intersection_count
from have as a left join have as b
on a.indid = b.indid where a.groupid<b.groupid
group by calculated group_pair;
quit;
... View more