BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tingnz
Calcite | Level 5


I have data that is related via a many to many relationship and I want to find all items that are related to each other and give them a unique id.

My example data is listed below:

PolicyId   PersonId

1          1000

1          1001

2          1001

3          1000

4          1002

2          1003

5          1003

4          1004

My desired output would either be

PolicyId   PersonId     GroupId

1          1000          1

1          1001          1

2          1001          1

3          1000          1

4          1002          2

2          1003          1

5          1003          1

4          1004          2

or

PolicyID     GroupId

1               1

2               1

3               1

4               2

5               1

PersonId     GroupId

1000          1

1001          1

1002          2

1003          1

1004          2

Do you have any suggestions on how to do this efficiently? 

I have managed to code this using a Macro, which for each policy, finds all related persons, then for each person, recursively calls the macro again for all related policies (that have not been assigned a GroupId already).  This works ok for a small number of values, but I have 1.9m observations in my dataset, so it performs very slowly.  There a some performance gains by calculating the cases where they is only one-to-many relationship between Policy and Person in a separate step, i.e. Policy 4 above would not be passed to the macro.  This means the number I'm checking reduces to about 174K. 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

OK. I will get these three tables for you . ( use my original code )

Code: Program

data have;
input from $  to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
   h.definekey('node');
   h.definedata('node');
   h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();


do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
   key=to;ry=_ha.check();
   if ry ne 0 then do;
   node=to;output;rr=no.remove(key:node);
   key=to;_ha.add();
   count+1;
   last=to;ha.add();
   end;
   rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data want1 ;
if _n_ eq 1 then do;
  if 0 then set want;
  declare hash ha(dataset:'want');
  ha.definekey('node');
  ha.definedata('household');
  ha.definedone();
end;
set have;
ha.find(key:from);
run;
proc sort data=want1(drop=to) out=want2 nodupkey;by from household;run;
proc sort data=want1(drop=from) out=want3 nodupkey;by to household;run;

Xia Keshan

View solution in original post

14 REPLIES 14
andreas_lds
Jade | Level 19

You should have posted your sas code.

The following code uses two hash objects:

data have;

input PolicyId   PersonId;

datalines;

1 1000

1 1001

2 1001

3 1000

4 1002

2 1003

5 1003

4 1004

;

run;

data _null_;

   set work.have end=last;

   length GroupId NextGroupId 8;

   retain NextGroupId 1;

   if _n_ = 1 then do;

      declare hash policies(ordered: 'yes');

      policies.defineKey('PolicyId');

      policies.defineData('PolicyId', 'GroupId');

      policies.defineDone();

      declare hash persons(ordered: 'yes');

      persons.defineKey('PersonId');

      persons.defineData('PersonId', 'GroupId');

      persons.defineDone();

   end;

   if policies.find() = 0 then do;

      /* PolicyID is already mapped to a Group */

      if persons.check() ^= 0 then do;

         /* but the personId is not mapped to the group */

         persons.add();

      end;

   end;

   else do;

      if persons.find() ^= 0 then do;

         /* no mappings exist */

         GroupId = NextGroupId;

         NextGroupId = NextGroupId + 1;

         persons.add();

         policies.add();

      end;

      else do;

         /* person-group mapping exists, but policy-group-mapping is missing */

         policies.add();

      end;

   end;

   if last then do;

      policies.output(dataset: 'work.PolicyGroupMapping');

      persons.output(dataset: 'work.PersonGroupMapping');

   end;

run;

Ksharp
Super User

Sorry, @andreas_lds . I am not tended to pick you up . Just curious your simple code can do such complicated task.

But your code would take a look forwards. See the following code , they all should be together.

data have;

input PolicyId   PersonId;

datalines;

0 1009

0 1005

1 1000

1 1001

1 1002

2 1001

3 1000

4 1002

2 1003

2 1004

5 1003

4 1005

;

run;

andreas_lds
Jade | Level 19

You are right, my code does not provide the correct result for your example.

tingnz
Calcite | Level 5

Yes, you're correct that I should've posted my code, but I was trying to come up with an example that wasn't using our company specific column names.  I will try your code a test on my full dataset and let you know.  I will also try add my data to the this discussion, (if I can work out how to quickly change the values to mask, as I'm using company sensitive data and also size permitting.  In fact our compliance area might not allow me to post)

tingnz
Calcite | Level 5

This works beautifully.  My code went from taking 16 hours to run to 14 seconds!

Ksharp
Super User

Check

Code: Program

data have;
input from $  to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
   h.definekey('node');
   h.definedata('node');
   h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;




data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp:20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node',hashexp:20);
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
/***********/
if 0 then set have;
declare hash k(dataset:'have');
k.definekey('from');
k.definedone();
/***********/

do while(hi_no.next()=0);
household+1; if k.check(key:node) ne 0 then output; /*********/
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
   key=to;ry=_ha.check();
   if ry ne 0 then do;
   node=to;if k.check(key:node) ne 0 then output; /*********/
   rr=no.remove(key:node);
   key=to;_ha.add();
   count+1;
   last=to;ha.add();
   end;
   rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
tingnz
Calcite | Level 5

Xia code works for his example, but only produces one of the datasets I'm looking for.

Ksharp
Super User

Can't you use that table to get another two tables ? That is easy I think.

tingnz
Calcite | Level 5

Xia's code does not work if the 'from' has the same values as 'to'

e.g.


data have;

input from $ to $;

cards;

10 10

10 11

1  10

2  20

;

run;


Ksharp
Super User

Are you sure ? OP is asking three tables. I just gave him one of them.

Check

Finde out who belong to the same household

Ksharp
Super User

OK. I will get these three tables for you . ( use my original code )

Code: Program

data have;
input from $  to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
   h.definekey('node');
   h.definedata('node');
   h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();


do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
   key=to;ry=_ha.check();
   if ry ne 0 then do;
   node=to;output;rr=no.remove(key:node);
   key=to;_ha.add();
   count+1;
   last=to;ha.add();
   end;
   rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data want1 ;
if _n_ eq 1 then do;
  if 0 then set want;
  declare hash ha(dataset:'want');
  ha.definekey('node');
  ha.definedata('household');
  ha.definedone();
end;
set have;
ha.find(key:from);
run;
proc sort data=want1(drop=to) out=want2 nodupkey;by from household;run;
proc sort data=want1(drop=from) out=want3 nodupkey;by to household;run;

Xia Keshan

tingnz
Calcite | Level 5

This amended code works and a whole lot faster than the original answer given in comment 2.  On my data, using code from comment 2, it took 1 hour 20 minutes and was dropping records.  The amended code above took 9 minutes and all policies and persons were accounted for.

tingnz
Calcite | Level 5

The code still doesn't work if we have "from" values the same as "to" values, e.g.

 

data have;

input from $ to $ ;

cards;

1 0

1 1

2 1

3 0

4 2

2 3

5 3

4 4

;

 

These are all getting assigned the same household, instead of separate households.

tingnz
Calcite | Level 5

I find that if I prefix either the 'From' or the 'To' with a letter then code works

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1750 views
  • 0 likes
  • 3 in conversation