About dr91

mkeintz · ‎02-22-2022

I suspect, for a large dataset, with many GROUPID*INDID values, this is most efficiently generated by (1) sorting by INDID/GROUPID, and (2) for each INDID, find all GROUPID combinations for that INDID and add to a running total for all INDID's. For each INDID, the second step is basically done by reading in a single group record with the current groupid in hand, iterate through a list of all prior groupid's and add to the count of each current/historical group pairs. add the current group id to the historical list. Go back to step 1. Using @ballardw 's dataset HAVE: data have; input groupid $ indid $; datalines; 1 A 1 B 1 C 1 D 2 A 2 B 2 C 3 D 3 E 3 F 4 A 4 C 4 E 4 F 5 B 5 D 5 F run; proc sort data=have out=need; by indid groupid; run; data _null_; set need end=end_of_need; by indid; if _n_=1 then do; length groupid2 $8 pair_count 8; declare hash hist (); hist.definekey('indid','groupid2'); hist.definedata('indid','groupid2'); hist.definedone(); declare hiter hi ('hist'); declare hash pairs(ordered:'A'); pairs.definekey('groupid','groupid2'); pairs.definedata('groupid','groupid2','pair_count'); pairs.definedone(); end; /* Now iterate through all the previous GROUPIDs for this INDID */ if not first.indid then do rc=hi.first() by 0 until (hi.next()^=0); pair_count=0; rc1=pairs.find(); /* Retrieve running total for this GROUPID,GROUPID2 pair*/ pair_count=pair_count+1; rc2=pairs.replace(); end; groupid2=groupid; /* Current groupid becomes part of group history*/ rc3=hist.add(); /* Add current groupid to hash H */ if last.indid then hist.clear(); /*Done with this ID? empty the hash object */ if end_of_need then pairs.output(dataset:'want'); run; This code keeps memory requirements down, since it needs only enough memory to hold (1) a list off all groupid's for a single indid (in hash object hist), and (2) a table of frequency for all encountered groupid pairs (no more than one-half of N of groups squared) - in hash object pairs. By contrast, the SQL below, while much less code, is probably a far greater consumer of memory: proc sql noprint; create table want as select catx('-',a.groupid,b.groupid) as group_pair, count(a.indid) as intersection_count from have as a left join have as b on a.indid = b.indid where a.groupid<b.groupid group by calculated group_pair; quit;

Online Status	Offline
Date Last Visited	‎02-22-2022 06:37 PM

Re: Identify crossover between groups

Identify crossover between groups

Re: Identify crossover between groups