Re: Column search to produce final column

BaileyY · Posted 06-26-2018 08:04 PM

Hi.

Looking for some assistance. I have two tables that I need combine to create a third. In one table I have three column of keys- (key1-key3). This table is automatically updated to consolidate and match using some fuzzy match resulting in a person's key being changed throughout the year from [oldkey1] to [newkey2] to a final key named [finalkey3]. This can happen when an individual visits different locations and are assigned different keys. (see row 3 and row 4 #10082782) which resulted in a final key determination of 36678 for this one individual.)

I have another table that I need to match against this three column key table. This other table contains a [key] which can be matched against the first table to arrive at a finalkey which will be the official assignment for each person. In this second table I have isolated just one person, but this person's data was entered under two different keys 10082792 & 71595)- but is ultimately person id#36678. Does anyone know how I can hit this second table against the first table to run and check each column (oldkey1, newkey2) to arrive at a final key? Any help would be appreciated. Thank you in advance

table 1
OLDKEY1	NEWKEY2	FINALKEY3
8172456	5230020	5230020
43015	8016157	8016157
2208	10082792	36678
71595	10082792	36678
85165	6572623	6572623
98048	7377116	7377116
102813	1405523	1405523
115732	2129398	2129398

table 2
KEY	PID	PID2
10082792	2	8
71595	9	0

results table
KEY	PID	PID2	finalkey
10082792	2	8	36678
71595	9	0	36678

lorenzo_hg · Posted 06-27-2018 07:28 AM

It's not the prettiest, but you may try the following:

PROC SQL;
    CREATE TABLE table3 AS
    SELECT t1.FINALKEY3 AS finalkey 
    FROM table1 AS t1
    LEFT JOIN table2 AS t2 WHERE (t2.KEY = t1.OLDKEY1 OR t2.KEY = t1.NEWKEY2);
QUIT;

mkeintz · Posted 06-27-2018 11:24 AM

Matching one var in a table against multiple vars in another table is a good situation for applying the WHICHN function (or WHICHC for character variables):

data table1     ;
  input OLDKEY1 NEWKEY2 FINALKEY3 ;
datalines;
8172456 5230020 5230020 
43015 8016157 8016157 
2208 10082792 36678 
71595 10082792 36678 
85165 6572623 6572623 
98048 7377116 7377116 
102813 1405523 1405523 
115732 2129398 2129398 
run;

data table2;     
  input KEY PID PID2 ;
datalines;
10082792 2 8 
71595 9 0 
run;

proc sql;
  create table want as 
    select distinct t2.*,t1.finalkey3
    from 
      table2 as t2
    left join
      table1 as t1
    on whichn(t2.key,t1.finalkey3,t1.oldkey1,t1.newkey2);
quit;

The select expression needs the DISTINCT to avoid duplicate results. Otherwise the first record in table 2 matches 2 instances in table 1, both having the same FINALKEY3.

The WHICHN function compares the first argument to the list of subsequent arguments and returns the position of the first match. If there are no matches, it returns 0.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

BaileyY · Posted 06-27-2018 12:23 PM

Hi. Thank you so much for taking time to offer me help. I read up on this function and it sounds like it should do what I am expecting to be done. I have a quick question- Appears that this proc sql will be performing cartesian join and it has taken over 30 minutes already to run- and its still running. Each of the tables contain about 4+ million records. Do you know of a way to speed up the process?

mkeintz · Posted 06-27-2018 02:12 PM

Ah yes. It occurs to me that using WHICHN might have convinced proc sql not to index table1 for lookup, which would result in using a cartesian join.. In that case, you can build your own lookup table incorporating all three variables as a single column (lookupkey):

proc sql;
  create table need as select oldkey1 as lookupkey,finalkey3 from table1 
       union           select newkey2 as lookupkey,finalkey3 from table1
       union           select finalkey3 as lookupkey,finalkey3 from table1;
  create table want as select t2.*, n.finalkey3 from table2 as t2 left join need as n on t2.key=n.lookupkey;
quit;

I think that PROC SQL might decide to make an index of need based on lookupkey. See if that's faster.

Also, the nice thing about the UNION operator is that it removes all duplicate LOOKUPKEY/FINALKEY combinations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 06-27-2018 02:37 PM

This data step code builds its own lookup table as a hash object. It might be faster than the SQL:

data want;
  set table1 (in=int1)  table2;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('lookupkey');
      h.definedata('finalkey3');
      h.definedone();
  end;
  ** If this is a table 1 record, add lookup keys to the hash object **;
  if int1 then do lookupkey=oldkey1,newkey2,finalkey3;
    rc=h.add();
  end;

  ** Now continue processing only table2 records **;
  if int1=0;
  rc=h.find(key:key);
  drop oldkey1--finalkey3 lookupkey rc;
run;

The RC (return code) variable yields a zero only when the FIND method or ADD method is successful. I generate it not for testing purposes, but only to avoid error messages.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Column search to produce final column