Joining similar data and arbitrarily forcing it to be 1:1 with no dupl...

lawatkey · Posted 11-03-2021 05:39 PM

DATA WORK.TABLE_1;
INPUT KEY AMOUNT AMOUNT_NUM $ DATE :DDMMYY10.;
FORMAT DATE DDMMYY10.;
CARDS;
1 10 000123 01/10/2021
2 10 000123 01/10/2021
3 10 000123 01/10/2021
;
RUN;

DATA WORK.TABLE_2;
INPUT CRIT AMOUNT AMOUNT_NUM $ DATE :DDMMYY10.;
FORMAT DATE DDMMYY10.;
CARDS;
1001 10 000123 01/10/2021
1002 10 000123 01/10/2021
1003 10 000123 01/10/2021
;
RUN;

DATA WORK.HASH_EXAMPLE;
SET WORK.TABLE_1;
CRIT=.;
IF _N_ = 1
THEN DO;
DECLARE HASH B (DATASET:"WORK.TABLE_2 (RENAME=(DATE=B_DATE))");
B.DEFINEKEY("AMOUNT_NUM","AMOUNT","B_DATE");
B.DEFINEDATA("CRIT");
B.DEFINEDONE();

END;
B_DATE = DATE;
RC = B.FIND();

DO WHILE (RC NE 0 AND B_DATE < DATE + 12);
B_DATE + 1;
RC = B.FIND();
END;
IF RC=0
THEN DO;
RC=B.REMOVE();
END;
DROP RC B_DATE;
RUN;

DATA WORK.TABLE_3;
INPUT CRIT AMOUNT AMOUNT_NUM $ DATE :DDMMYY10. CRIT;
FORMAT DATE DDMMYY10.;
CARDS;
1 10 000123 01/10/2021 1001
2 10 000123 01/10/2021 1002
3 10 000123 01/10/2021 1003
;
RUN;

So with the above program, I am starting with Table_1 as my source set of data, Table_2 is the table I am trying to left join into. The hash example is what I thought was the best way to accomplish the end output I wanted, which is table 3.

Wants: To take Table_1, and based on 3 criteria (one of them being a date that I want to join with a 12 day range), to get a matching "Crit". Anytime a Crit is matched to a given Key, I no longer want to evaluate the matched Key or Crit values. In theory, I would be fine if Key 1 matched with Crit 1003, per the example, as long as Keys 2 and 3 matched with Crits 1002 and 1001 in someway. Any thoughts on the best way of accomplishing this?

s_lassen · Posted 11-05-2021 07:20 AM

Note: I edited this reply so show the complete solution instead of just showing a suggested change (my second post).

You data are not very representative, as all records match. But here is a solution that may work:

First, create a table of all possible joins:

proc sql;                                       
  create table join as select                   
    table_1.key,table_2.crit as crit_2,count(*) as N      
  from table_1 join table_2                
  on table_1.amount=table_2.amount and          
     table_1.amount_num=table_2.amount_num and  
     abs(table_1.date-table_2.date) between 0 and 12
  group by table_2.crit                          
  ;                                             
quit;

Then sort by CRIT_2 and number of KEYs for that CRIT:

proc sort;             
  by crit_2 n;
run;

Create the output table, with an index:

data want(index=(key));
  set table_1;         
  crit=.;              
run;

Finally, update the output with CRIT values from the join:

data want;                                                        
  found=0;                                                        
  do until(last.crit_2);                                          
    set join;                                                     
    by crit_2;                                                    
    modify want key=key/unique; /* Changed: use /UNIQUE to be able to read the same key twice */                                          
    if found then continue; /* this CRIT has already been used */ 
    if crit=. then do; /* This KEY has not been matched yet */    
      crit=crit_2;                                                
      replace;                                                    
      found=1;  /* This CRIT has now been used */                                                  
      end;                                                        
    end;                                                          
run;

This solution is not guaranteed to get the best match, but because we use the joins with fewest possible matches first, it is a good approximation.

lawatkey · Posted 11-05-2021 08:39 AM

@s_lassen I like the program you posted, thank you! Unfortunately when applying to my actual datasets, I get "ERROR: No matching observation was found in MASTER data set." Not sure how it's possible based on the initial join criteria for table WANT. I even tried to make table WANT from Table_1 where the key was filtered based solely on the keys that matched in the JOIN table, no dice. I imagine the issue is if a Crit is matched to a Key and then another Key is evaluated where its only match was that last Crit that was matched, then it would have no remaining matches left. Is there a way to modify the IF statement in the last step to just leave the crit as blank or a . if that scenario happens?

s_lassen · Posted 11-08-2021 03:27 AM

I think I know what the problem is: you are looking for the same key twice, which may happen with real life data.

I think you should change

modify want key=key;

to

modify want key=key/unique;

That may help. I edited my first reply to show the complete solution after the change.

lawatkey · Posted 11-10-2021 09:00 AM

Thank you very much! The code works quite well, that's amazing 🙂 I might try to make it into a macro to start with a 0 day range to "prioritize" closer matches then waterfall down and expand the date range by 1 on each end per loop, unless you know of a simpler way, like an order by statement in the join, but if not what you've provided is still great, I really appreciate it.

s_lassen · Posted 11-10-2021 09:30 AM

I am not quite sure what you want to do with the "waterfall", but is it prioritizing the matches by the time interval?

Then the solution might be to order by the date difference as well:

proc sql;                                       
  create table join as select                   
    table_1.key,table_2.crit as crit_2,count(*) as N,
    abs(table_1.date-table_2.date) as date_diff     
  from table_1 join table_2                
  on table_1.amount=table_2.amount and          
     table_1.amount_num=table_2.amount_num and  
     abs(table_1.date-table_2.date) between 0 and 12
  group by table_2.crit                          
  ;                                             
quit;

Then use that for sorting as well:

proc sort;             
  by crit_2 date_diff n;
run;

And do the rest as in my original post.

Hope this is something like what you want!

Tom · Posted 11-10-2021 09:34 AM

I didn't read your whole thread but have you checked out Propensity Score matching algorithms? SAS even has a PSMATCH procedure now.

Joining similar data and arbitrarily forcing it to be 1:1 with no duplicates

Re: Joining similar data and arbitrarily forcing it to be 1:1 with no duplicates

Re: Joining similar data and arbitrarily forcing it to be 1:1 with no duplicates

Re: Joining similar data and arbitrarily forcing it to be 1:1 with no duplicates

Re: Joining similar data and arbitrarily forcing it to be 1:1 with no duplicates

Re: Joining similar data and arbitrarily forcing it to be 1:1 with no duplicates

Re: Joining similar data and arbitrarily forcing it to be 1:1 with no duplicates

Classroom Training Available!