BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
RobVoss
Fluorite | Level 6

I need to merge two files of patient data, each verified to have only one row per person. Each file contains name, SSN, and MedID, but each field contains many missings and patients that don't overlap both files. I'd like to combine like this:

if SSN matches, then merge,

else if MedID matches then merge,

else if name matches then merge.

 

I thought to do each step and save only the matches, but then when I combinded to three output datasets I'd have a lot of duplicate records, since some might match on more than one criteria.

 

Is there a way to push the match flag (in=x) back to the original datasets? Then I could embark on step 2 while omitting those records already matched.

 

I'm not new to SAS, but this really has me stumped.

 

Greatly appreciate any help that's out there.

 -r

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You could use SQL:

 

proc sql;
create table merged as
select ....
from 
	set1 inner join
	set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;
PG

View solution in original post

9 REPLIES 9
PGStats
Opal | Level 21

You could use SQL:

 

proc sql;
create table merged as
select ....
from 
	set1 inner join
	set2 on set1.SSN=set2.SSN or set1.MedID=set2.MedID or set1.name=set2.name;
quit;
PG
RobVoss
Fluorite | Level 6

Sorry to be slow with my gratitude here but, THANKS!

This was just the solution I needed.

I expanded the "ON" portion to accomodate missing values, like this:

proc sql ;
create table m3 as
select * from h full join c 
	on (
		(h.ssn=c.ssn and c.ssn ne "")
		or
		(h.cin =c.cin and c.cin ne "")or
		(h.name =c.name and c.name ne "")
		);
quit;

I'm including a WORD Attachment of code that includes sample datasets in case anyone finds this later and wants to explore. Sorry I didn't include code to begin with, would have made post easier to follow.

Thanks again.

Best,

rob

 

RobVoss
Fluorite | Level 6

One last thought.

My attached code (both the inner of full joins) does result in duplicates, so beware if you try it. I'm currently trying to sort this out: learning about fuzzy matching and match scores. The big takeaway for me is that this can't be solved in a data step but the SQL join offers lots of possibilities to match on several vars sequentially while saving a measure (match scores) that can be used to rank the matches and trim down to the final result. Code by KSharp might get to this result too.

-r

Reeza
Super User

The difference between a SQL solution and  @Ksharp hash solution will most likely come down to speed.

The hash solution should be faster, but you trade it for code complexity. 

 

If you're doing this once stick with the SQL. If you need to set up a process, explore the hash solution.

RobVoss
Fluorite | Level 6

thanks for these: they look very comprehensive and will probably be useful in future.

-r

Ksharp
Super User


data x;
input Application_ID   (Email_ID  IP_ID Address_ID phone_ID) ( :$20.);
cards;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
11 email21 ip21 address21 phone21
12 email22 ip21 address22 phone22
13 email22 ip22 address23 phone23
;
run;

data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;

if cmiss(of x{*})=dim(x) then delete;
 else if cmiss(of x{*})=1 then do;
     from=coalescec(of x{*});to=uuidgen(0);output; 
 end;
  else do;

          do i=1 to dim(x)-1;
           from=x{i};
            do j=i+1 to dim(x);
             to=x{j};
             if not missing(from) and not missing(to) then output;
             end;
          end;

       end;


 keep from to;
run;




data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data final_want;
if _n_=1 then do;
 if 0 then set want;
 declare hash h(dataset:'want');
 h.definekey('node');
 h.definedata('household');
 h.definedone();
end;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 do i=1 to dim(x);
  node=x{i};call missing(household);
  if h.find()=0 then leave;
 end;
 drop i node;
run;
proc print noobs;run;
RobVoss
Fluorite | Level 6

Thank you so much for offering this up. It's kind of intimidating, so I went with the SQL solution offered up by PGStats. It seems to work. With much respect though I ask, do you think your solution offers capabilities that that simple SQL solution does not? I'm sure you would agree that simple is better, but not if it means giving up a lot.

 

thanks!

-r

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 989 views
  • 3 likes
  • 4 in conversation