Solved: Re: Matching data

Krysia24 · Posted 07-24-2018 06:35 PM

Does anyone have insight as to how to match/link two datasets in multiple steps? Essentially, I would like to try a "deterministic match" where first I try matching two datasets by the SSN varialbe and if that works, those matched records output, but for the remaining records that do not match I then to try matching by a different unique identifier such as FIRSTNAME_LASTNAME and then again for anything that matches, those records are outputted, and then finally for those records that still did not match in the previous two steps, I try a different unique identifier to match on and output matches.

Trying to gauge what the best approach would be for something like this...

hashman · Posted 07-25-2018 03:59 AM

Suppose you have data akin to the sample data sets below. In other words, 2 records from D1 match on all fields, 2 records match on (FN,LN) and ID only, and 2 records match on ID only.

data d1 ;                          
  input SSN1 (FN1 LN1) (:$8.) ID1 ;
  cards ;                          
111111111   Samuel  Smith     1    
111111110   Samuel  Smith     1    
111111110   Sam     Smith     1    
111111110   Sam     Smith     0    
333333333   Sarah   Jones     3    
333333330   Sarah   Jones     3    
333333330   Sara    Jones     3    
333333330   Sara    Jones     0    
run ;                              
                                   
data d2 ;                          
  input SSN2 (FN2 LN2) (:$8.) ID2 ;
  cards ;                          
111111111   Samuel   Smith     1   
222222222   David    Wilson    2   
333333333   Sarah    Jones     3   
run ;

A brute force approach could be as follows. The matching DS1 records will have Match=1 and not matching will have Match=0.

proc sql ;                        
  create table match as           
  select *                        
       , ID2 is not null as Match 
  from   d1 left join d2          
  on     SSN1 = SSN2              
  or     (FN1 = FN2 and LN1 = LN2)
  or     ID1 = ID2                
  ;                               
quit ;

One problem with this frontal SQL attack (and it can be a big problem if you have sizable data) is that SQL forms a Cartesian product before the ON clause can be satisfied. Luckily, we have the hash object, by means of which we can search for extra keys only if matching on the previous keys has failed:

data match ;                                              
  if _n_ = 1 then do ;                                    
    dcl hash x () ;                                       
    x.definekey ("SSN2") ;                                
    x.definedata ("p") ;                                  
    x.definedone () ;                                     
                                                          
    dcl hash y () ;                                       
    y.definekey ("FN2", "LN2") ;                          
    y.definedata ("p") ;                                  
    y.definedone () ;                                     
                                                          
    dcl hash z () ;                                       
    z.definekey ("ID2") ;                                 
    z.definedata ("p") ;                                  
    z.definedone () ;                                     
                                                          
    if 0 then set d1 ;                                    
    do p = 1 by 1 until (z2) ;                            
      set d2 end = z2 ;                                   
      x.ref() ;                                           
      y.ref() ;                                           
      z.ref() ;                                           
    end ;                                                 
  end ;                                                   
                                                          
  call missing (of _all_) ;                             
  set d1 ;                                               
  Match = x.find(key:SSN1) eq 0 ;                         
  if not Match then Match = y.find(key:FN1,key:LN1) eq 0 ;
  if not Match then Match = z.find(key:ID1) eq 0 ;        
  if Match then set d2 point = p ;                        
run ;

It may seem a bit verbose but there's nothing really complex about it: We merely load different keys from D2 into different hashes and, for every record from D1, first search the table for SSN. If we haven't got a match, we look at the table with the first and last name, and so on. At the end, we have either Match=1 or Match=0. If we have a match, we retrieve the satellite variables from D2 by the obs number. (Instead, we could have them in the data portions of the hashes, but I find it cleaner this way.) IF 0 THEN SET D1 is not mandatory: It's there only to put the variables from D1 in front of the variables from D2 in the output.

HTH

Paul D.

View solution in original post

ballardw · Posted 07-24-2018 07:05 PM

Try searching the CDC.gov web site for "Link Plus". That is a program that does a probabilistic match on multiple variables simultaneous. This a FREE program that you already paid for with your taxes if a US citizen.

The input is two text files not difficult to create from SAS.

Krysia24 · Posted 07-24-2018 07:07 PM

Thank you for the suggestion. I do want to try the probabilistic for comparison, but in this case I want to employ deterministic as well just to see how the two linkage methods would compare.

hashman · Posted 07-25-2018 03:59 AM

Suppose you have data akin to the sample data sets below. In other words, 2 records from D1 match on all fields, 2 records match on (FN,LN) and ID only, and 2 records match on ID only.

data d1 ;                          
  input SSN1 (FN1 LN1) (:$8.) ID1 ;
  cards ;                          
111111111   Samuel  Smith     1    
111111110   Samuel  Smith     1    
111111110   Sam     Smith     1    
111111110   Sam     Smith     0    
333333333   Sarah   Jones     3    
333333330   Sarah   Jones     3    
333333330   Sara    Jones     3    
333333330   Sara    Jones     0    
run ;                              
                                   
data d2 ;                          
  input SSN2 (FN2 LN2) (:$8.) ID2 ;
  cards ;                          
111111111   Samuel   Smith     1   
222222222   David    Wilson    2   
333333333   Sarah    Jones     3   
run ;

A brute force approach could be as follows. The matching DS1 records will have Match=1 and not matching will have Match=0.

proc sql ;                        
  create table match as           
  select *                        
       , ID2 is not null as Match 
  from   d1 left join d2          
  on     SSN1 = SSN2              
  or     (FN1 = FN2 and LN1 = LN2)
  or     ID1 = ID2                
  ;                               
quit ;

One problem with this frontal SQL attack (and it can be a big problem if you have sizable data) is that SQL forms a Cartesian product before the ON clause can be satisfied. Luckily, we have the hash object, by means of which we can search for extra keys only if matching on the previous keys has failed:

data match ;                                              
  if _n_ = 1 then do ;                                    
    dcl hash x () ;                                       
    x.definekey ("SSN2") ;                                
    x.definedata ("p") ;                                  
    x.definedone () ;                                     
                                                          
    dcl hash y () ;                                       
    y.definekey ("FN2", "LN2") ;                          
    y.definedata ("p") ;                                  
    y.definedone () ;                                     
                                                          
    dcl hash z () ;                                       
    z.definekey ("ID2") ;                                 
    z.definedata ("p") ;                                  
    z.definedone () ;                                     
                                                          
    if 0 then set d1 ;                                    
    do p = 1 by 1 until (z2) ;                            
      set d2 end = z2 ;                                   
      x.ref() ;                                           
      y.ref() ;                                           
      z.ref() ;                                           
    end ;                                                 
  end ;                                                   
                                                          
  call missing (of _all_) ;                             
  set d1 ;                                               
  Match = x.find(key:SSN1) eq 0 ;                         
  if not Match then Match = y.find(key:FN1,key:LN1) eq 0 ;
  if not Match then Match = z.find(key:ID1) eq 0 ;        
  if Match then set d2 point = p ;                        
run ;

It may seem a bit verbose but there's nothing really complex about it: We merely load different keys from D2 into different hashes and, for every record from D1, first search the table for SSN. If we haven't got a match, we look at the table with the first and last name, and so on. At the end, we have either Match=1 or Match=0. If we have a match, we retrieve the satellite variables from D2 by the obs number. (Instead, we could have them in the data portions of the hashes, but I find it cleaner this way.) IF 0 THEN SET D1 is not mandatory: It's there only to put the variables from D1 in front of the variables from D2 in the output.

HTH

Paul D.

Krysia24 · Posted 07-25-2018 07:02 PM

Wow! The hash method works great. Just as you warned, because I have a lot of records, the SQL option was not as ideal.

I am running into one problem with the hash method, though, and that is that records with SSN's that are missing in both datasets are matching (despite not having other variables in common). Is there room to incorporate if/then logic into this code? So essentially, I do not want missing SSN's to match in the first step - it should move onto the next identifier step instead (first and last name) and only match if there's a commonality there. Thanks!

hashman · Posted 07-25-2018 07:26 PM

This is DATA step logic, so everything is adjustable. Just replace the line:

  Match = x.find(key:SSN1) eq 0 ;

with:

  Match = x.find(key:SSN1) eq 0 and not missing (SSN1) ;

and you should be fine. Alternatively, you can replace it with 2 lines:

  if missing (SSN1) then Match = 0 ;  
  else Match = x.find(key:SSN1) eq 0 ;

You get the picture.

Best

Paul D.

Krysia24 · Posted 07-27-2018 01:50 PM

Thank you! This worked out great! I appreciate learning about the hash tables.

Krysia24 · Posted 07-30-2018 03:46 PM

Hi again,

Apologies for resurfacing this but upon some QA I am realizing I have an issue with the code that I'm not sure what the cause of it is.

Here is the code that I ran based on what you provided:

data match ;
if _n_ = 1 then do ;
    dcl hash x () ;
    x.definekey ("MOMSSN") ;
    x.definedata ("p") ;
    x.definedone () ;

    dcl hash y () ;
    y.definekey ("BMOMFIRSTNAME", "BMOMLASTNAME", "momyeardob") ;
    y.definedata ("p") ;
    y.definedone () ;


    if 0 then set deaths2 ;
    do p = 1 by 1 until (y2) ;
      set births2 end = y2 ;
      x.ref() ;
      y.ref() ;
    end ;
end ;

call missing (of _all_) ;
set deaths2 ;
if (SSN) = . then Match = 0 ; /*first trying to match on SSN alone and do not want matches on missing SSN's alone*/
else Match = x.find(key:SSN) eq 0 ;
if not Match then Match = y.find(key:firstname,key:lastname,key:dbirthyear) eq 0 ; /*if there was not a match on SSN, trying to match on first name.
last name, and year of birth*/
if Match then set births2 point = p ;
run ;

Now, in the output dataset MATCH - there is a key variable (datedeath) that I need for further analysis - that went missing for a lot of
observations. The source dataset for this variable is deaths2 and in deaths2, none of the observations have missing data for this
variable. It appears that most of the observations that have missing data for datedeath in the matched dataset MATCH now are the variables that have match =1. (And again - given that the source data does not have any observations missing this information, this is odd. Is there something in the match step that would have caused this, even though I do not cite that specific variable at all?

Also, one thing of note (not sure if it's relevant) - there should only be one death record for each individual BUT they can potentially
be in the matched dataset (births2) multiple times if they've had multiple births...

Thank you for your help! I thought I was grasping this but this error is leaving me a bit confused.

Classroom Training Available!