Re: Table Joins without a Specific Key

DipuRahman · Posted 11-29-2021 03:56 PM

I have two tables: TABLE_A and TABLE_B

Both tables TABLE_A and TABLE_B are updated daily. I’d like to determine which observations in TABLE_A are “affected” by updates in TABLE_B. This is very specific business requirement. What makes it challenging is that there is no key that can be used to join the tables directly. The PROC SQL I have takes more than half hour to complete. See the code below. Wondering if anyone has a better solution.

Note on the columns. Let’s say TABLE_B has a column called code_1, while TABLE_A has a corresponding column code_1_list. Values in code_1 column are alpha-numerical, while code_1_list values are quoted, and comma separated. For example:

TABLE_B

code_1

code1_A

TABLE_A:

code_1_list

‘code1_A’, ‘code1_B’, ‘code1_C’

I need to determine: if *any* observation in TABLE_B is in the list of corresponding column in TABLE_A, I need to include it (or keep/identify it).

By the way, both tables TABLE_A and TABLE_B have 25 of these corresponding columns. Both tables TABLE_A and TABLE_B have ~15k observations on any given day.

Here is a limited example (added notes for clarification):

TABLE_B

ID	code_1	code_2	code_3	code_4	Note (corresponding to TABLE_A)
123	code1_A	code2_B	code3_W	code4_D	In ID: A1, code_1_list and A3, code_2_list
124	code1_E	code2_B	code3_P	code4_R	In ID: A3, code_2_list
125	code1_H	code2_J	code3_K	code4_E
126	code1_A	code2_P	code3_A	code4_K	In ID: A1, code_1_list also ID: A3 code_2_list

TABLE_A

ID	code_1_list	code_2_list	code_3_list	code_4_list	note
A1	'code1_A', 'code1_C'	'code2_A', 'code2_K', 'code2_G'	'code3_F'	'code4_A', 'code4_L'	keep: ID 123, code_1 in TABLE_B
A2	'code1_B', 'code1_C', 'code1_J'	'code2_A', 'code2_G'	'code3_L'	'code4_I', 'code4_R'
A3	'code1_A'	'code2_B', 'code2_P'	'code3_F'	'code4_A', 'code4_M'	keep: ID 123, code_1 and ID 123 & 124, code_2

From the example above, only observations A1 and A3 are kept in TABLE_A.

Your thoughts? It seems there might be a much simpler solution that is escaping me...

SQL:

proc sql;

create table

work.table_ab as

select distinct

A.*

from

work.TABLE_A A,

work.TABLE_B B

where

(

A.code_1_list like (cats("'%",B.code_1,"%'"))

or

A.code_2_list like (cats("'%",B.code_2,"%'"))

or

A.code_3_list like (cats("'%",B.code_3,"%'"))

or

A.code_4_list like (cats("'%",B.code_4,"%'"))

)

;

quit;

ChrisNZ · Posted 11-30-2021 02:05 AM

This could be faster:

where index(A.code_1_list ,B.code_1)
   or index(A.code_2_list ,B.code_2)
   or index(A.code_3_list ,B.code_3)
   or index(A.code_4_list ,B.code_4)

You might need to trim B values if the field is too long for its value

High-Performance SAS Coding - Third Edition

s_lassen · Posted 11-30-2021 07:51 AM

A data step solution using keyed lookup may be faster.

If your data looks something like this:

data b;                                                            
  input id (code_1-code_4) ($);                                    
cards;                                                             
123 code1_A code2_B code3_W code4_D                                
124 code1_E code2_B code3_P code4_R                                
125 code1_H code2_J code3_K code4_E                                
126 code1_A code2_P code3_A code4_K                                
;run;                                                              
data a;                                                            
  infile cards dsd delimiter=';' truncover;                        
  length ID $2 code_1_list code_2_list code_3_list $20;            
  input ID--code_3_list;                                           
cards4;                                                            
A1;'code1_A', 'code1_C';'code2_A', 'code2_K', 'code2_G';'code3_F'  
A2;'code1_B', 'code1_C', 'code1_J';code2_A', 'code2_G';'code3_L'   
A3;'code1_A';'code2_B', 'code2_P';'code3_F';                       
;;;;run;

You first transpose and index the codes in B:

data codes(index=(idx=(num code)) keep=num code);
  set b;                                         
  array codes(*) code_:;                         
  do num=1 to dim(codes);                        
    code=codes(num);                             
    output;                                      
    end;

run;

Then you use keyed lookup to check the A dataset:

data want;                             
  set a;                               
  array lists(*) code_:;               
  do num=1 to dim(lists);              
    do _N_=1 to countw(lists(num));    
      code=scan(lists(num),_N_,"', "); 
      set codes key=idx/unique;        
      if not _iorc_ then do;           
        output;                        
        _error_=0;                     
        delete;                        
        end;                           
      end;                             
    end;                               
  _error_=0;                           
  drop num code;                       
run;

The "_error_=0" statement comes in because whenever a code is not found, it generates an "error". Which you probably do not want to see in the log.

Table Joins without a Specific Key

Re: Table Joins without a Specific Key

Re: Table Joins without a Specific Key

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away