Hi.
Looking for some assistance. I have two tables that I need combine to create a third. In one table I have three column of keys- (key1-key3). This table is automatically updated to consolidate and match using some fuzzy match resulting in a person's key being changed throughout the year from [oldkey1] to [newkey2] to a final key named [finalkey3]. This can happen when an individual visits different locations and are assigned different keys. (see row 3 and row 4 #10082782) which resulted in a final key determination of 36678 for this one individual.)
I have another table that I need to match against this three column key table. This other table contains a [key] which can be matched against the first table to arrive at a finalkey which will be the official assignment for each person. In this second table I have isolated just one person, but this person's data was entered under two different keys 10082792 & 71595)- but is ultimately person id#36678. Does anyone know how I can hit this second table against the first table to run and check each column (oldkey1, newkey2) to arrive at a final key? Any help would be appreciated. Thank you in advance
table 1 | ||
OLDKEY1 | NEWKEY2 | FINALKEY3 |
8172456 | 5230020 | 5230020 |
43015 | 8016157 | 8016157 |
2208 | 10082792 | 36678 |
71595 | 10082792 | 36678 |
85165 | 6572623 | 6572623 |
98048 | 7377116 | 7377116 |
102813 | 1405523 | 1405523 |
115732 | 2129398 | 2129398 |
table 2 | ||
KEY | PID | PID2 |
10082792 | 2 | 8 |
71595 | 9 | 0 |
results table | |||
KEY | PID | PID2 | finalkey |
10082792 | 2 | 8 | 36678 |
71595 | 9 | 0 | 36678 |
It's not the prettiest, but you may try the following:
PROC SQL; CREATE TABLE table3 AS SELECT t1.FINALKEY3 AS finalkey FROM table1 AS t1 LEFT JOIN table2 AS t2 WHERE (t2.KEY = t1.OLDKEY1 OR t2.KEY = t1.NEWKEY2); QUIT;
Matching one var in a table against multiple vars in another table is a good situation for applying the WHICHN function (or WHICHC for character variables):
data table1 ;
input OLDKEY1 NEWKEY2 FINALKEY3 ;
datalines;
8172456 5230020 5230020
43015 8016157 8016157
2208 10082792 36678
71595 10082792 36678
85165 6572623 6572623
98048 7377116 7377116
102813 1405523 1405523
115732 2129398 2129398
run;
data table2;
input KEY PID PID2 ;
datalines;
10082792 2 8
71595 9 0
run;
proc sql;
create table want as
select distinct t2.*,t1.finalkey3
from
table2 as t2
left join
table1 as t1
on whichn(t2.key,t1.finalkey3,t1.oldkey1,t1.newkey2);
quit;
The select expression needs the DISTINCT to avoid duplicate results. Otherwise the first record in table 2 matches 2 instances in table 1, both having the same FINALKEY3.
The WHICHN function compares the first argument to the list of subsequent arguments and returns the position of the first match. If there are no matches, it returns 0.
Ah yes. It occurs to me that using WHICHN might have convinced proc sql not to index table1 for lookup, which would result in using a cartesian join.. In that case, you can build your own lookup table incorporating all three variables as a single column (lookupkey):
proc sql;
create table need as select oldkey1 as lookupkey,finalkey3 from table1
union select newkey2 as lookupkey,finalkey3 from table1
union select finalkey3 as lookupkey,finalkey3 from table1;
create table want as select t2.*, n.finalkey3 from table2 as t2 left join need as n on t2.key=n.lookupkey;
quit;
I think that PROC SQL might decide to make an index of need based on lookupkey. See if that's faster.
Also, the nice thing about the UNION operator is that it removes all duplicate LOOKUPKEY/FINALKEY combinations.
This data step code builds its own lookup table as a hash object. It might be faster than the SQL:
data want;
set table1 (in=int1) table2;
if _n_=1 then do;
declare hash h ();
h.definekey('lookupkey');
h.definedata('finalkey3');
h.definedone();
end;
** If this is a table 1 record, add lookup keys to the hash object **;
if int1 then do lookupkey=oldkey1,newkey2,finalkey3;
rc=h.add();
end;
** Now continue processing only table2 records **;
if int1=0;
rc=h.find(key:key);
drop oldkey1--finalkey3 lookupkey rc;
run;
The RC (return code) variable yields a zero only when the FIND method or ADD method is successful. I generate it not for testing purposes, but only to avoid error messages.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.