BookmarkSubscribeRSS Feed
BaileyY
Obsidian | Level 7

Hi. 

 

Looking for some assistance.  I have two tables that I need combine to create a third.   In one table I have three column of keys- (key1-key3).  This table is automatically updated to consolidate and match using some fuzzy match resulting in a person's key being changed throughout the year from [oldkey1] to [newkey2] to a final key named [finalkey3].   This can happen when an individual visits different locations and are assigned different keys.   (see row 3 and row 4 #10082782) which resulted in a final key determination of 36678 for this one individual.)   

 

I have another table that I need to match against this three column key table.  This other table contains a [key] which can be matched against the first table to arrive at a finalkey which will be the official assignment for each person.  In this second table I have isolated just one person, but this person's data was entered under two different keys 10082792 & 71595)- but is ultimately person id#36678.  Does anyone know how I can hit this second table against the first table to run and check each column (oldkey1, newkey2) to arrive at a final key?   Any help would be appreciated.  Thank you in advance

 

 

table 1    
OLDKEY1 NEWKEY2 FINALKEY3
8172456 5230020 5230020
43015 8016157 8016157
2208 10082792 36678
71595 10082792 36678
85165 6572623 6572623
98048 7377116 7377116
102813 1405523 1405523
115732 2129398 2129398
     

 

table 2    
KEY PID PID2
10082792 2 8
71595 9 0
     

 

results table      
KEY PID PID2 finalkey
10082792 2 8 36678
71595 9 0 36678

 

5 REPLIES 5
lorenzo_hg
Fluorite | Level 6

 It's not the prettiest, but you may try the following:

 

PROC SQL;
    CREATE TABLE table3 AS
    SELECT t1.FINALKEY3 AS finalkey 
    FROM table1 AS t1
    LEFT JOIN table2 AS t2 WHERE (t2.KEY = t1.OLDKEY1 OR t2.KEY = t1.NEWKEY2);
QUIT;

 

mkeintz
PROC Star

Matching one var in a table against multiple vars in another table is a good situation for applying the WHICHN function (or WHICHC for character variables):

 

data table1     ;
  input OLDKEY1 NEWKEY2 FINALKEY3 ;
datalines;
8172456 5230020 5230020 
43015 8016157 8016157 
2208 10082792 36678 
71595 10082792 36678 
85165 6572623 6572623 
98048 7377116 7377116 
102813 1405523 1405523 
115732 2129398 2129398 
run;

data table2;     
  input KEY PID PID2 ;
datalines;
10082792 2 8 
71595 9 0 
run;

proc sql;
  create table want as 
    select distinct t2.*,t1.finalkey3
    from 
      table2 as t2
    left join
      table1 as t1
    on whichn(t2.key,t1.finalkey3,t1.oldkey1,t1.newkey2);
quit;

The select expression needs the DISTINCT to avoid duplicate results.  Otherwise the first record in table 2 matches 2 instances in table 1, both having the same FINALKEY3.

 

 

The WHICHN function compares the first argument to the list of subsequent arguments and returns the position of the first match.  If there are no matches, it returns 0.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
BaileyY
Obsidian | Level 7
Hi. Thank you so much for taking time to offer me help. I read up on this function and it sounds like it should do what I am expecting to be done. I have a quick question- Appears that this proc sql will be performing cartesian join and it has taken over 30 minutes already to run- and its still running. Each of the tables contain about 4+ million records. Do you know of a way to speed up the process?
mkeintz
PROC Star

Ah yes.  It occurs to me that using WHICHN might have convinced proc sql not to index table1 for lookup, which would result in using a cartesian join..  In that case, you can build your own lookup table incorporating all three variables as a single column (lookupkey):

 

proc sql;
  create table need as select oldkey1 as lookupkey,finalkey3 from table1 
       union           select newkey2 as lookupkey,finalkey3 from table1
       union           select finalkey3 as lookupkey,finalkey3 from table1;
  create table want as select t2.*, n.finalkey3 from table2 as t2 left join need as n on t2.key=n.lookupkey;
quit;

 

I think that PROC SQL might decide to make an index of need based on lookupkey.   See if that's faster.

 

Also, the nice thing about the UNION operator is that it removes all duplicate   LOOKUPKEY/FINALKEY combinations.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

This data step code builds its own lookup table as a hash object.  It might be faster than the SQL:

 

data want;
  set table1 (in=int1)  table2;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('lookupkey');
      h.definedata('finalkey3');
      h.definedone();
  end;
  ** If this is a table 1 record, add lookup keys to the hash object **;
  if int1 then do lookupkey=oldkey1,newkey2,finalkey3;
    rc=h.add();
  end;

  ** Now continue processing only table2 records **;
  if int1=0;
  rc=h.find(key:key);
  drop oldkey1--finalkey3 lookupkey rc;
run;

 

  

The RC (return code) variable yields a zero only when the FIND method or ADD method is successful.  I generate it not for testing purposes, but only to avoid error messages.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 798 views
  • 1 like
  • 3 in conversation