Solved: Re: Merging with multiple, incomplet by variables

RobVoss · Posted 03-30-2017 08:43 PM

I need to merge two files of patient data, each verified to have only one row per person. Each file contains name, SSN, and MedID, but each field contains many missings and patients that don't overlap both files. I'd like to combine like this:

if SSN matches, then merge,

else if MedID matches then merge,

else if name matches then merge.

I thought to do each step and save only the matches, but then when I combinded to three output datasets I'd have a lot of duplicate records, since some might match on more than one criteria.

Is there a way to push the match flag (in=x) back to the original datasets? Then I could embark on step 2 while omitting those records already matched.

I'm not new to SAS, but this really has me stumped.

Greatly appreciate any help that's out there.

-r

PGStats · Posted 03-30-2017 10:45 PM

You could use SQL:

proc sql;
create table merged as
select ....
from 
	set1 inner join
	set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;

PG

View solution in original post

PGStats · Posted 03-30-2017 10:45 PM

You could use SQL:

proc sql;
create table merged as
select ....
from 
	set1 inner join
	set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;

PG

RobVoss · Posted 04-05-2017 11:29 AM

Sorry to be slow with my gratitude here but, THANKS!

This was just the solution I needed.

I expanded the "ON" portion to accomodate missing values, like this:

proc sql ;
create table m3 as
select * from h full join c 
	on (
		(h.ssn=c.ssn and c.ssn ne "")
		or
		(h.cin =c.cin and c.cin ne "")or
		(h.name =c.name and c.name ne "")
		);
quit;

I'm including a WORD Attachment of code that includes sample datasets in case anyone finds this later and wants to explore. Sorry I didn't include code to begin with, would have made post easier to follow.

Thanks again.

Best,

rob

RobVoss · Posted 04-05-2017 03:28 PM

One last thought.

My attached code (both the inner of full joins) does result in duplicates, so beware if you try it. I'm currently trying to sort this out: learning about fuzzy matching and match scores. The big takeaway for me is that this can't be solved in a data step but the SQL join offers lots of possibilities to match on several vars sequentially while saving a measure (match scores) that can be used to rank the matches and trim down to the final result. Code by KSharp might get to this result too.

-r

Reeza · Posted 04-05-2017 03:48 PM

The difference between a SQL solution and @Ksharp hash solution will most likely come down to speed.

The hash solution should be faster, but you trade it for code complexity.

If you're doing this once stick with the SQL. If you need to set up a process, explore the hash solution.

Reeza · Posted 03-30-2017 11:46 PM

http://www.the-link-king.com

Link Plus from CDC

https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

RobVoss · Posted 04-05-2017 11:32 AM

thanks for these: they look very comprehensive and will probably be useful in future.

-r

Ksharp · Posted 03-31-2017 10:57 AM

https://communities.sas.com/t5/General-SAS-Programming/How-to-find-link-between-nodes/td-p/341819

Check the last post of mine at the first page.

Ksharp · Posted 03-31-2017 11:16 AM



data x;
input Application_ID   (Email_ID  IP_ID Address_ID phone_ID) ( :$20.);
cards;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
11 email21 ip21 address21 phone21
12 email22 ip21 address22 phone22
13 email22 ip22 address23 phone23
;
run;

data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;

if cmiss(of x{*})=dim(x) then delete;
 else if cmiss(of x{*})=1 then do;
     from=coalescec(of x{*});to=uuidgen(0);output; 
 end;
  else do;

          do i=1 to dim(x)-1;
           from=x{i};
            do j=i+1 to dim(x);
             to=x{j};
             if not missing(from) and not missing(to) then output;
             end;
          end;

       end;


 keep from to;
run;




data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data final_want;
if _n_=1 then do;
 if 0 then set want;
 declare hash h(dataset:'want');
 h.definekey('node');
 h.definedata('household');
 h.definedone();
end;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 do i=1 to dim(x);
  node=x{i};call missing(household);
  if h.find()=0 then leave;
 end;
 drop i node;
run;
proc print noobs;run;

RobVoss · Posted 04-05-2017 11:37 AM

Thank you so much for offering this up. It's kind of intimidating, so I went with the SQL solution offered up by PGStats. It seems to work. With much respect though I ask, do you think your solution offers capabilities that that simple SQL solution does not? I'm sure you would agree that simple is better, but not if it means giving up a lot.

thanks!

-r

Classroom Training Available!