I need to merge two files of patient data, each verified to have only one row per person. Each file contains name, SSN, and MedID, but each field contains many missings and patients that don't overlap both files. I'd like to combine like this:
if SSN matches, then merge,
else if MedID matches then merge,
else if name matches then merge.
I thought to do each step and save only the matches, but then when I combinded to three output datasets I'd have a lot of duplicate records, since some might match on more than one criteria.
Is there a way to push the match flag (in=x) back to the original datasets? Then I could embark on step 2 while omitting those records already matched.
I'm not new to SAS, but this really has me stumped.
Greatly appreciate any help that's out there.
-r
You could use SQL:
proc sql;
create table merged as
select ....
from
set1 inner join
set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;
You could use SQL:
proc sql;
create table merged as
select ....
from
set1 inner join
set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;
Sorry to be slow with my gratitude here but, THANKS!
This was just the solution I needed.
I expanded the "ON" portion to accomodate missing values, like this:
proc sql ;
create table m3 as
select * from h full join c
on (
(h.ssn=c.ssn and c.ssn ne "")
or
(h.cin =c.cin and c.cin ne "")or
(h.name =c.name and c.name ne "")
);
quit;
I'm including a WORD Attachment of code that includes sample datasets in case anyone finds this later and wants to explore. Sorry I didn't include code to begin with, would have made post easier to follow.
Thanks again.
Best,
rob
One last thought.
My attached code (both the inner of full joins) does result in duplicates, so beware if you try it. I'm currently trying to sort this out: learning about fuzzy matching and match scores. The big takeaway for me is that this can't be solved in a data step but the SQL join offers lots of possibilities to match on several vars sequentially while saving a measure (match scores) that can be used to rank the matches and trim down to the final result. Code by KSharp might get to this result too.
-r
The difference between a SQL solution and @Ksharp hash solution will most likely come down to speed.
The hash solution should be faster, but you trade it for code complexity.
If you're doing this once stick with the SQL. If you need to set up a process, explore the hash solution.
thanks for these: they look very comprehensive and will probably be useful in future.
-r
https://communities.sas.com/t5/General-SAS-Programming/How-to-find-link-between-nodes/td-p/341819
Check the last post of mine at the first page.
data x;
input Application_ID (Email_ID IP_ID Address_ID phone_ID) ( :$20.);
cards;
1 email1 ip1 address1 phone1
2 email2 ip2 address1 phone2
3 email3 ip2 address2 phone5
4 email5 ip1 address3 phone13
5 email1 ip13 address13 phone13
11 email21 ip21 address21 phone21
12 email22 ip21 address22 phone22
13 email22 ip22 address23 phone23
;
run;
data have;
set x;
array x{*} $ Email_ID IP_ID Address_ID phone_ID;
length from to $ 100;
if cmiss(of x{*})=dim(x) then delete;
else if cmiss(of x{*})=1 then do;
from=coalescec(of x{*});to=uuidgen(0);output;
end;
else do;
do i=1 to dim(x)-1;
from=x{i};
do j=i+1 to dim(x);
to=x{j};
if not missing(from) and not missing(to) then output;
end;
end;
end;
keep from to;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
data final_want;
if _n_=1 then do;
if 0 then set want;
declare hash h(dataset:'want');
h.definekey('node');
h.definedata('household');
h.definedone();
end;
set x;
array x{*} $ Email_ID IP_ID Address_ID phone_ID;
do i=1 to dim(x);
node=x{i};call missing(household);
if h.find()=0 then leave;
end;
drop i node;
run;
proc print noobs;run;
Thank you so much for offering this up. It's kind of intimidating, so I went with the SQL solution offered up by PGStats. It seems to work. With much respect though I ask, do you think your solution offers capabilities that that simple SQL solution does not? I'm sure you would agree that simple is better, but not if it means giving up a lot.
thanks!
-r
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.