Solved: Re: Finding and grouping items related by many to many

tingnz · Posted 08-12-2015 12:20 AM

I have data that is related via a many to many relationship and I want to find all items that are related to each other and give them a unique id.

My example data is listed below:

PolicyId PersonId

1 1000

1 1001

2 1001

3 1000

4 1002

2 1003

5 1003

4 1004

My desired output would either be

PolicyId PersonId GroupId

1 1000 1

1 1001 1

2 1001 1

3 1000 1

4 1002 2

2 1003 1

5 1003 1

4 1004 2

or

PolicyID GroupId

1 1

2 1

3 1

4 2

5 1

PersonId GroupId

1000 1

1001 1

1002 2

1003 1

1004 2

Do you have any suggestions on how to do this efficiently?

I have managed to code this using a Macro, which for each policy, finds all related persons, then for each person, recursively calls the macro again for all related policies (that have not been assigned a GroupId already). This works ok for a small number of values, but I have 1.9m observations in my dataset, so it performs very slowly. There a some performance gains by calculating the cases where they is only one-to-many relationship between Policy and Person in a separate step, i.e. Policy 4 above would not be passed to the macro. This means the number I'm checking reduces to about 174K.

Ksharp · Posted 08-18-2015 09:46 AM

OK. I will get these three tables for you . ( use my original code )

Code: Program

data have;
input from $  to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
   h.definekey('node');
   h.definedata('node');
   h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
   key=to;ry=_ha.check();
   if ry ne 0 then do;
   node=to;output;rr=no.remove(key:node);
   key=to;_ha.add();
   count+1;
   last=to;ha.add();
   end;
   rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data want1 ;
 if _n_ eq 1 then do;
  if 0 then set want;
  declare hash ha(dataset:'want');
  ha.definekey('node');
  ha.definedata('household');
  ha.definedone();
 end;
set have;
ha.find(key:from);
run;
proc sort data=want1(drop=to) out=want2 nodupkey;by from household;run;
proc sort data=want1(drop=from) out=want3 nodupkey;by to household;run;

Xia Keshan

View solution in original post

andreas_lds · Posted 08-13-2015 07:32 AM

You should have posted your sas code.

The following code uses two hash objects:

data have;

input PolicyId PersonId;

datalines;

1 1000

1 1001

2 1001

3 1000

4 1002

2 1003

5 1003

4 1004

;

run;

data _null_;

set work.have end=last;

length GroupId NextGroupId 8;

retain NextGroupId 1;

if _n_ = 1 then do;

declare hash policies(ordered: 'yes');

policies.defineKey('PolicyId');

policies.defineData('PolicyId', 'GroupId');

policies.defineDone();

declare hash persons(ordered: 'yes');

persons.defineKey('PersonId');

persons.defineData('PersonId', 'GroupId');

persons.defineDone();

end;

if policies.find() = 0 then do;

/* PolicyID is already mapped to a Group */

if persons.check() ^= 0 then do;

/* but the personId is not mapped to the group */

persons.add();

end;

else do;

if persons.find() ^= 0 then do;

/* no mappings exist */

GroupId = NextGroupId;

NextGroupId = NextGroupId + 1;

persons.add();

policies.add();

end;

else do;

/* person-group mapping exists, but policy-group-mapping is missing */

policies.add();

end;

if last then do;

policies.output(dataset: 'work.PolicyGroupMapping');

persons.output(dataset: 'work.PersonGroupMapping');

end;

run;

Ksharp · Posted 08-13-2015 10:41 AM

Sorry, @andreas_lds . I am not tended to pick you up . Just curious your simple code can do such complicated task.

But your code would take a look forwards. See the following code , they all should be together.

data have;

input PolicyId PersonId;

datalines;

0 1009

0 1005

1 1000

1 1001

1 1002

2 1001

3 1000

4 1002

2 1003

2 1004

5 1003

4 1005

;

run;

andreas_lds · Posted 08-14-2015 02:54 AM

You are right, my code does not provide the correct result for your example.

tingnz · Posted 08-13-2015 08:29 PM

Yes, you're correct that I should've posted my code, but I was trying to come up with an example that wasn't using our company specific column names. I will try your code a test on my full dataset and let you know. I will also try add my data to the this discussion, (if I can work out how to quickly change the values to mask, as I'm using company sensitive data and also size permitting. In fact our compliance area might not allow me to post)

tingnz · Posted 08-14-2015 02:10 AM

This works beautifully. My code went from taking 16 hours to run to 14 seconds!

Ksharp · Posted 08-13-2015 10:13 AM

Check

Code: Program

data have;
input from $  to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
   h.definekey('node');
   h.definedata('node');
   h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;




data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp:20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node',hashexp:20);
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
/***********/
if 0 then set have;
declare hash k(dataset:'have');
k.definekey('from');
k.definedone();
/***********/

do while(hi_no.next()=0);
 household+1; if k.check(key:node) ne 0 then output; /*********/
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
   key=to;ry=_ha.check();
   if ry ne 0 then do;
   node=to;if k.check(key:node) ne 0 then output; /*********/
   rr=no.remove(key:node);
   key=to;_ha.add();
   count+1;
   last=to;ha.add();
   end;
   rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

tingnz · Posted 08-14-2015 03:11 AM

Xia code works for his example, but only produces one of the datasets I'm looking for.

Ksharp · Posted 08-15-2015 07:24 AM

Can't you use that table to get another two tables ? That is easy I think.

tingnz · Posted 08-18-2015 01:48 AM

Xia's code does not work if the 'from' has the same values as 'to'

e.g.

data have;

input from $ to $;

cards;

10 10

10 11

1 10

2 20

;

run;

Ksharp · Posted 08-18-2015 07:45 AM

Are you sure ? OP is asking three tables. I just gave him one of them.

Check

Finde out who belong to the same household

Ksharp · Posted 08-18-2015 09:46 AM

OK. I will get these three tables for you . ( use my original code )

Code: Program

data have;
input from $  to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
   h.definekey('node');
   h.definedata('node');
   h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
   key=to;ry=_ha.check();
   if ry ne 0 then do;
   node=to;output;rr=no.remove(key:node);
   key=to;_ha.add();
   count+1;
   last=to;ha.add();
   end;
   rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data want1 ;
 if _n_ eq 1 then do;
  if 0 then set want;
  declare hash ha(dataset:'want');
  ha.definekey('node');
  ha.definedata('household');
  ha.definedone();
 end;
set have;
ha.find(key:from);
run;
proc sort data=want1(drop=to) out=want2 nodupkey;by from household;run;
proc sort data=want1(drop=from) out=want3 nodupkey;by to household;run;

Xia Keshan

tingnz · Posted 08-20-2015 01:42 AM

This amended code works and a whole lot faster than the original answer given in comment 2. On my data, using code from comment 2, it took 1 hour 20 minutes and was dropping records. The amended code above took 9 minutes and all policies and persons were accounted for.

tingnz · Posted 09-16-2015 01:39 AM

The code still doesn't work if we have "from" values the same as "to" values, e.g.

data have;

input from $ to $ ;

cards;

1 0

1 1

2 1

3 0

4 2

2 3

5 3

4 4

;

These are all getting assigned the same household, instead of separate households.

tingnz · Posted 09-16-2015 03:49 AM

I find that if I prefix either the 'From' or the 'To' with a letter then code works

Code: Program

Code: Program

Code: Program

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away