Identify crossover between groups

dr91 · Posted 02-22-2022 11:43 AM

Hi,

I'm trying to find a way to identify the level of crossover between groups of individuals in my data.

For a dataset like:

Group ID	Individual ID
1	A
1	B
1	C
1	D
2	A
2	B
2	C
3	D
3	E
3	F

I'd like to be able to do something that would highlight any groups that share the a certain percentage of their members. So here'd I like to flag that for Group 1, 75% of it's members are also in Group 2, and so there's a connection between these groups, but only 25% of its members are in group 3 and so there isn't a connection here. Obviously the real data is much larger than the example above, and I'd want every group to be compared against every other one (so that a link between Group 2 and Group 1 would be flagged as 1 contains 100% of 2, and no link between 2 and 3 would be flagged as 3 contains 0% of 1).

I'm stumped as to how to do it though, or if this is even something that is possible. Any help would be much appreciated.

HB · Posted 02-22-2022 12:23 PM

I think it is the Cartesian element (needing to compare everything to everything) that is the killer here.

There is probably a built in function that does this and I'm sure one of the SuperUser geniuses around here can tell you about it. I don't know it.

As a lessor mortal, I would take a no doubt incorrect line:

1. Convert your data to a list of groups with an ordered string that is the group members.

1 ABCD

2 ABC

3 DEF

2. Use a loop to compare 2 copies of the groups lists to each other using COMPGED. I am envisioning a matrix with the groups as row and column headings and a COMPGED score in each cell representing the diff in the strings for the row and column headings for that cell.

3. If I knew how to write it, I would have done so.

Edit:

I still don't know how to make a list of strings representing the individuals in each group but my cartesian comment made me realize a simple join would do it.

So:

data orig;
	input GroupID Indstring $10.;
	datalines;
1 ABCD
2 ABC
3 DEF
;
run;

proc sql;
	select a.indstring, b.indstring, compged(a.indstring, b.indstring)
	from orig a, orig b;
run;

gives me

Indstring	Indstring
ABCD	ABCD	0
ABCD	ABC	50
ABCD	DEF	450
ABC	ABCD	10
ABC	ABC	0
ABC	DEF	400
DEF	ABCD	410
DEF	ABC	400
DEF	DEF	0

Low score is high overlap, high score is low overlap. Bring along other data elements as needed. Name things better. Delete the rows where the strings are the same. Sort in asc or desc order. Scale the score to whatever flag setting you like.

Edit #2:

A Cartesian match with 4.8 million groups? I'm out.

Reeza · Posted 02-22-2022 01:19 PM

How many unique groups do you have?

Untested but something like this should work, may not scale well if you have a lot of groups though. I think this gives you counts of the overlaps and you'll have to convert them to a percentage somehow, not sure what the denominator will be.

data long;
set have;
value=1;
run;

proc sort data=long;
by id;
run;

proc transpose data=long out=wide prefix=GROUP;
by id;
id group;
var value;
run;

*fill in missing 0's;
data wide_filled;
set wide;
array _g(*) group:;
do i=1 to dim(_g);
if _g(i) = . then _g(i) = 0;
end;
run;

ods output sscp=coocs;
proc corr data=wide_filled sscp;
var group:;
run;

proc print data=coocs;
run;

@dr91 wrote:

Hi,

I'm trying to find a way to identify the level of crossover between groups of individuals in my data.

For a dataset like:

Group ID Individual ID

1 A

1 B

1 C

1 D

2 A

2 B

2 C

3 D

3 E

3 F

I'd like to be able to do something that would highlight any groups that share the a certain percentage of their members. So here'd I like to flag that for Group 1, 75% of it's members are also in Group 2, and so there's a connection between these groups, but only 25% of its members are in group 3 and so there isn't a connection here. Obviously the real data is much larger than the example above, and I'd want every group to be compared against every other one (so that a link between Group 2 and Group 1 would be flagged as 1 contains 100% of 2, and no link between 2 and 3 would be flagged as 3 contains 0% of 1).

I'm stumped as to how to do it though, or if this is even something that is possible. Any help would be much appreciated.

PeterClemmensen · Posted 02-22-2022 02:49 PM

How much 'much larger' is your actual data? 🙂

The DATA to DATA Step Macro
Blog: SASnrd

dr91 · Posted 02-22-2022 03:15 PM

Ah, yeah, that's probably quite important, and the biggest reason this is causing me problems.

About 4.8m groups.

PeterClemmensen · Posted 02-22-2022 03:26 PM

And how much memory do you have available?

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 02-22-2022 03:43 PM

@dr91 wrote:
Ah, yeah, that's probably quite important, and the biggest reason this is causing me problems.

About 4.8m groups.

Is that after reducing to distinct groups? I.E no duplicates of group and individual id?

And how many unique individual id values?

ballardw · Posted 02-22-2022 04:30 PM

Here is an approach that compares pairs of group ids. There are multiple count steps the "final" data set has the percent of common elements and currently a single variable that contains the two group ids.

Cautions: Your number of group ids is going to be too large to fit into a single macro variable if there are indeed 4.8 million individual group id values. If your comment of 4.8m groups meant combinations of group and individual id then this might work. This is one way to get the values of the groupid into single macro variable that could be used as the call to the not very imaginatively named Complicated macro.

If your data has duplicates of the groupid and indid then you need to reduce your data set so that there are no duplicates for this to work. (Potentially significantly reducing the 4.8M number since not much detail has been provided on your data)

Proc sql noprint;
    select distinct groupid into: grouplist separated by ' '
   from have
   ;
quit;

And a final caution: this may take quite awhile to run depending on actual number of groups and unique records.

Assuming that there are actually 10,000 group id values there are going to be on the order of 50,000,000 pairs of comparisons run. Which at 1 second each run (possibly optimistic) will take roughly one year and 7 months to complete.

Suggestion would be to possibly reduce your analysis to a limited group of either Group or individual ids.

Obviously you need to use your variable and data set names everywhere I reference Groupid, Indid, or Have

I have arbitrarily created a single variable for the pair. You could use two variables instead but either has a potential issue with lengths. The Proc sql could be used instead of the macro parameter but still needs to be watched for length.

I might suggest testing the macro with no more than 5 or so values of your actual Groupid variable that you have some ideas.

You can filter on the resulting Percent value in the final Pairs data set.

data have;
   input groupid $ indid $;
datalines;
1	A
1	B
1	C
1	D
2	A
2	B
2	C
3	D
3	E
3	F
4  A
4  C
4  E
4  F
5  B
5  D
5  F
;
%macro complicated(grouplist=);

%do i=1 %to %eval(%sysfunc(countw(&grouplist.)) - 1);
   %do j= %eval (&i. +1) %to %sysfunc(countw(&grouplist.))
       %let firstgroup= %scan(&grouplist.,&i.);
       %let secondgroup= %scan(&grouplist.,&j.);
       %put &firstgroup. &secondgroup.;
       Proc freq data=have noprint;
          where groupid in ("&firstgroup." "&secondgroup");
          tables indid*groupid/out=temp;
      run;
      proc freq data=temp order=freq noprint;
         tables indid/out=temp2 outcum;
      run;
      proc freq data=temp2 (rename=(count=tocount)) order=freq noprint;
         tables tocount/out=temp3;
      run;
      data temp4;
         set temp3;
         length group_pair $ 50;
         where tocount=2;
         group_pair = "&firstgroup. &secondgroup.";
         drop count tocount;
      run;
      proc append base=pairs data= temp4;
      run;
   %end;
%end;
%mend;

%complicated(grouplist=1 2 3 4 5 );

mkeintz · Posted 02-22-2022 10:22 PM

I suspect, for a large dataset, with many GROUPID*INDID values, this is most efficiently generated by (1) sorting by INDID/GROUPID, and (2) for each INDID, find all GROUPID combinations for that INDID and add to a running total for all INDID's.

For each INDID, the second step is basically done by

reading in a single group record
with the current groupid in hand, iterate through a list of all prior groupid's and add to the count of each current/historical group pairs.
add the current group id to the historical list.
Go back to step 1.

Using @ballardw 's dataset HAVE:

data have;
   input groupid $ indid $;
datalines;
1	A
1	B
1	C
1	D
2	A
2	B
2	C
3	D
3	E
3	F
4  A
4  C
4  E
4  F
5  B
5  D
5  F
run;

proc sort data=have out=need; by indid groupid; 
run;

data _null_;
  set need end=end_of_need;
  by indid;

  if _n_=1 then do;
    length groupid2 $8   pair_count 8;
    declare hash hist ();
      hist.definekey('indid','groupid2');
      hist.definedata('indid','groupid2');
      hist.definedone();
    declare hiter hi ('hist');

    declare hash pairs(ordered:'A');
      pairs.definekey('groupid','groupid2');
      pairs.definedata('groupid','groupid2','pair_count');
      pairs.definedone();
  end;

  /* Now iterate through all the previous GROUPIDs for this INDID */
  if not first.indid then do rc=hi.first() by 0 until (hi.next()^=0);
    pair_count=0;
    rc1=pairs.find();         /* Retrieve running total for this GROUPID,GROUPID2 pair*/
    pair_count=pair_count+1;
    rc2=pairs.replace();
  end;

  groupid2=groupid;          /* Current groupid becomes part of group history*/
  rc3=hist.add();               /* Add current groupid to hash H */
  if last.indid then hist.clear();  /*Done with this ID? empty the hash object */

  if end_of_need then pairs.output(dataset:'want');
run;

This code keeps memory requirements down, since it needs only enough memory to hold (1) a list off all groupid's for a single indid (in hash object hist), and (2) a table of frequency for all encountered groupid pairs (no more than one-half of N of groups squared) - in hash object pairs.

By contrast, the SQL below, while much less code, is probably a far greater consumer of memory:

proc sql noprint;
  create table want as select
  catx('-',a.groupid,b.groupid) as group_pair,
  count(a.indid) as intersection_count
  from have as a   left join have as b
  on a.indid = b.indid  where a.groupid<b.groupid
  group by calculated group_pair;
quit;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Identify crossover between groups

Re: Identify crossover between groups

Re: Identify crossover between groups

Re: Identify crossover between groups

Re: Identify crossover between groups

Re: Identify crossover between groups

Re: Identify crossover between groups

Re: Identify crossover between groups

Re: Identify crossover between groups

Registration is open

SAS Training: Just a Click Away