I have data that is related via a many to many relationship and I want to find all items that are related to each other and give them a unique id.
My example data is listed below:
PolicyId PersonId
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
My desired output would either be
PolicyId PersonId GroupId
1 1000 1
1 1001 1
2 1001 1
3 1000 1
4 1002 2
2 1003 1
5 1003 1
4 1004 2
or
PolicyID GroupId
1 1
2 1
3 1
4 2
5 1
PersonId GroupId
1000 1
1001 1
1002 2
1003 1
1004 2
Do you have any suggestions on how to do this efficiently?
I have managed to code this using a Macro, which for each policy, finds all related persons, then for each person, recursively calls the macro again for all related policies (that have not been assigned a GroupId already). This works ok for a small number of values, but I have 1.9m observations in my dataset, so it performs very slowly. There a some performance gains by calculating the cases where they is only one-to-many relationship between Policy and Person in a separate step, i.e. Policy 4 above would not be passed to the macro. This means the number I'm checking reduces to about 174K.
OK. I will get these three tables for you . ( use my original code )
data have;
input from $ to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node; output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
data want1 ;
if _n_ eq 1 then do;
if 0 then set want;
declare hash ha(dataset:'want');
ha.definekey('node');
ha.definedata('household');
ha.definedone();
end;
set have;
ha.find(key:from);
run;
proc sort data=want1(drop=to) out=want2 nodupkey;by from household;run;
proc sort data=want1(drop=from) out=want3 nodupkey;by to household;run;
Xia Keshan
You should have posted your sas code.
The following code uses two hash objects:
data have;
input PolicyId PersonId;
datalines;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data _null_;
set work.have end=last;
length GroupId NextGroupId 8;
retain NextGroupId 1;
if _n_ = 1 then do;
declare hash policies(ordered: 'yes');
policies.defineKey('PolicyId');
policies.defineData('PolicyId', 'GroupId');
policies.defineDone();
declare hash persons(ordered: 'yes');
persons.defineKey('PersonId');
persons.defineData('PersonId', 'GroupId');
persons.defineDone();
end;
if policies.find() = 0 then do;
/* PolicyID is already mapped to a Group */
if persons.check() ^= 0 then do;
/* but the personId is not mapped to the group */
persons.add();
end;
end;
else do;
if persons.find() ^= 0 then do;
/* no mappings exist */
GroupId = NextGroupId;
NextGroupId = NextGroupId + 1;
persons.add();
policies.add();
end;
else do;
/* person-group mapping exists, but policy-group-mapping is missing */
policies.add();
end;
end;
if last then do;
policies.output(dataset: 'work.PolicyGroupMapping');
persons.output(dataset: 'work.PersonGroupMapping');
end;
run;
Sorry, @andreas_lds . I am not tended to pick you up . Just curious your simple code can do such complicated task.
But your code would take a look forwards. See the following code , they all should be together.
data have;
input PolicyId PersonId;
datalines;
0 1009
0 1005
1 1000
1 1001
1 1002
2 1001
3 1000
4 1002
2 1003
2 1004
5 1003
4 1005
;
run;
You are right, my code does not provide the correct result for your example.
Yes, you're correct that I should've posted my code, but I was trying to come up with an example that wasn't using our company specific column names. I will try your code a test on my full dataset and let you know. I will also try add my data to the this discussion, (if I can work out how to quickly change the values to mask, as I'm using company sensitive data and also size permitting. In fact our compliance area might not allow me to post)
This works beautifully. My code went from taking 16 hours to run to 14 seconds!
data have;
input from $ to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node; output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp:20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node',hashexp:20);
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
/***********/
if 0 then set have;
declare hash k(dataset:'have');
k.definekey('from');
k.definedone();
/***********/
do while(hi_no.next()=0);
household+1; if k.check(key:node) ne 0 then output; /*********/
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;if k.check(key:node) ne 0 then output; /*********/
rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Xia code works for his example, but only produces one of the datasets I'm looking for.
Can't you use that table to get another two tables ? That is easy I think.
Xia's code does not work if the 'from' has the same values as 'to'
e.g.
data have;
input from $ to $;
cards;
10 10
10 11
1 10
2 20
;
run;
Are you sure ? OP is asking three tables. I just gave him one of them.
Check
OK. I will get these three tables for you . ( use my original code )
data have;
input from $ to $ ;
cards;
1 1000
1 1001
2 1001
3 1000
4 1002
2 1003
5 1003
4 1004
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node; output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 16);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
data want1 ;
if _n_ eq 1 then do;
if 0 then set want;
declare hash ha(dataset:'want');
ha.definekey('node');
ha.definedata('household');
ha.definedone();
end;
set have;
ha.find(key:from);
run;
proc sort data=want1(drop=to) out=want2 nodupkey;by from household;run;
proc sort data=want1(drop=from) out=want3 nodupkey;by to household;run;
Xia Keshan
This amended code works and a whole lot faster than the original answer given in comment 2. On my data, using code from comment 2, it took 1 hour 20 minutes and was dropping records. The amended code above took 9 minutes and all policies and persons were accounted for.
The code still doesn't work if we have "from" values the same as "to" values, e.g.
data have;
input from $ to $ ;
cards;
1 0
1 1
2 1
3 0
4 2
2 3
5 3
4 4
;
These are all getting assigned the same household, instead of separate households.
I find that if I prefix either the 'From' or the 'To' with a letter then code works
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.