I received participant data from 3 independent research sites. If an individual was seen at multiple sites, they will have received a different ID from each site. In order to identify which participants are the same individuals across sites, I ran their demographic info through a probabilistic matching software that compares two records at a time. The software outputs a file like this:
data have;
input subjectid_1 subjectid_2;
cards;
1 5
2 7
3 6
4 5
7 8
;
run;
*For example, subjects 1 and 5 are the same person.;
*Notes:
1) The sites do not use each other's IDs (e.g. SUBJECTID_1=7 is definitely the same person as SUBJECTID_2=7)
2) Assume the matching software results are correct.
;
I now want any given individual to only be counted in one row, with all their IDs listed in the columns:
data want;
input subjectid_1 subjectid_2 subjectid_3;
cards;
1 5 4
2 7 8
3 6
;
run;
Is there a way to do this? After it's organized this way, I can assign them my own standardized ID and know how many unique individuals there are.
After searching the forum a bit I found code @Ksharp proposed for a similar problem here that should return what you're after.
data have;
input from to;
cards;
1 5
2 7
3 6
4 5
7 8
;
data full;
set have end=last;
if _n_ eq 1 then
do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from;
h.replace();
from=to;
to=node;
output;
node=from;
h.replace();
if last then
h.output(dataset:'node');
drop node;
run;
data want(keep=node group_id);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
group_id+1;
output;
count=1;
key=node;
_ha.add();
last=node;
ha.add();
rc=hi.first();
do while(rc=0);
from=last;
rx=from_to.find();
do while(rx=0);
key=to;
ry=_ha.check();
if ry ne 0 then
do;
node=to;
output;
rr=no.remove(key:node);
key=to;
_ha.add();
count+1;
last=to;
ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();
_ha.clear();
end;
stop;
run;
proc print data=want;
run;
After searching the forum a bit I found code @Ksharp proposed for a similar problem here that should return what you're after.
data have;
input from to;
cards;
1 5
2 7
3 6
4 5
7 8
;
data full;
set have end=last;
if _n_ eq 1 then
do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from;
h.replace();
from=to;
to=node;
output;
node=from;
h.replace();
if last then
h.output(dataset:'node');
drop node;
run;
data want(keep=node group_id);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
group_id+1;
output;
count=1;
key=node;
_ha.add();
last=node;
ha.add();
rc=hi.first();
do while(rc=0);
from=last;
rx=from_to.find();
do while(rx=0);
key=to;
ry=_ha.check();
if ry ne 0 then
do;
node=to;
output;
rr=no.remove(key:node);
key=to;
_ha.add();
count+1;
last=to;
ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();
_ha.clear();
end;
stop;
run;
proc print data=want;
run;
This is awesome - thank you, @Patrick and @Ksharp!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.