DATA Step, Macro, Functions and more

Merging with multiple, incomplet by variables

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

Merging with multiple, incomplet by variables

I need to merge two files of patient data, each verified to have only one row per person. Each file contains name, SSN, and MedID, but each field contains many missings and patients that don't overlap both files. I'd like to combine like this:

if SSN matches, then merge,

else if MedID matches then merge,

else if name matches then merge.

 

I thought to do each step and save only the matches, but then when I combinded to three output datasets I'd have a lot of duplicate records, since some might match on more than one criteria.

 

Is there a way to push the match flag (in=x) back to the original datasets? Then I could embark on step 2 while omitting those records already matched.

 

I'm not new to SAS, but this really has me stumped.

 

Greatly appreciate any help that's out there.

 -r


Accepted Solutions
Solution
‎04-05-2017 11:21 AM
Respected Advisor
Posts: 4,644

Re: Merging with multiple, incomplet by variables

You could use SQL:

 

proc sql;
create table merged as
select ....
from 
	set1 inner join
	set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;
PG

View solution in original post


All Replies
Solution
‎04-05-2017 11:21 AM
Respected Advisor
Posts: 4,644

Re: Merging with multiple, incomplet by variables

You could use SQL:

 

proc sql;
create table merged as
select ....
from 
	set1 inner join
	set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;
PG
Occasional Contributor
Posts: 5

Re: Merging with multiple, incomplet by variables

Sorry to be slow with my gratitude here but, THANKS!

This was just the solution I needed.

I expanded the "ON" portion to accomodate missing values, like this:

proc sql ;
create table m3 as
select * from h full join c 
	on (
		(h.ssn=c.ssn and c.ssn ne "")
		or
		(h.cin =c.cin and c.cin ne "")or
		(h.name =c.name and c.name ne "")
		);
quit;

I'm including a WORD Attachment of code that includes sample datasets in case anyone finds this later and wants to explore. Sorry I didn't include code to begin with, would have made post easier to follow.

Thanks again.

Best,

rob

 

Occasional Contributor
Posts: 5

Re: Merging with multiple, incomplet by variables

One last thought.

My attached code (both the inner of full joins) does result in duplicates, so beware if you try it. I'm currently trying to sort this out: learning about fuzzy matching and match scores. The big takeaway for me is that this can't be solved in a data step but the SQL join offers lots of possibilities to match on several vars sequentially while saving a measure (match scores) that can be used to rank the matches and trim down to the final result. Code by KSharp might get to this result too.

-r

Super User
Posts: 17,785

Re: Merging with multiple, incomplet by variables

The difference between a SQL solution and  @Ksharp hash solution will most likely come down to speed.

The hash solution should be faster, but you trade it for code complexity. 

 

If you're doing this once stick with the SQL. If you need to set up a process, explore the hash solution.

Super User
Posts: 17,785

Re: Merging with multiple, incomplet by variables

Occasional Contributor
Posts: 5

Re: Merging with multiple, incomplet by variables

thanks for these: they look very comprehensive and will probably be useful in future.

-r

Super User
Posts: 9,676

Re: Merging with multiple, incomplet by variables

Super User
Posts: 9,676

Re: Merging with multiple, incomplet by variables



data x;
input Application_ID   (Email_ID  IP_ID Address_ID phone_ID) ( :$20.);
cards;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
11 email21 ip21 address21 phone21
12 email22 ip21 address22 phone22
13 email22 ip22 address23 phone23
;
run;

data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;

if cmiss(of x{*})=dim(x) then delete;
 else if cmiss(of x{*})=1 then do;
     from=coalescec(of x{*});to=uuidgen(0);output; 
 end;
  else do;

          do i=1 to dim(x)-1;
           from=x{i};
            do j=i+1 to dim(x);
             to=x{j};
             if not missing(from) and not missing(to) then output;
             end;
          end;

       end;


 keep from to;
run;




data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data final_want;
if _n_=1 then do;
 if 0 then set want;
 declare hash h(dataset:'want');
 h.definekey('node');
 h.definedata('household');
 h.definedone();
end;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 do i=1 to dim(x);
  node=x{i};call missing(household);
  if h.find()=0 then leave;
 end;
 drop i node;
run;
proc print noobs;run;
Occasional Contributor
Posts: 5

Re: Merging with multiple, incomplet by variables

Thank you so much for offering this up. It's kind of intimidating, so I went with the SQL solution offered up by PGStats. It seems to work. With much respect though I ask, do you think your solution offers capabilities that that simple SQL solution does not? I'm sure you would agree that simple is better, but not if it means giving up a lot.

 

thanks!

-r

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 175 views
  • 3 likes
  • 4 in conversation