BookmarkSubscribeRSS Feed
DipuRahman
Calcite | Level 5

I have two tables: TABLE_A and TABLE_B

 

Both tables TABLE_A and TABLE_B are updated daily. I’d like to determine which observations in TABLE_A are “affected” by updates in TABLE_B. This is very specific business requirement. What makes it challenging is that there is no key that can be used to join the tables directly. The PROC SQL I have takes more than half hour to complete. See the code below. Wondering if anyone has a better solution.

 

Note on the columns. Let’s say TABLE_B has a column called code_1, while TABLE_A has a corresponding column code_1_list. Values in code_1 column are alpha-numerical, while code_1_list values are quoted, and comma separated. For example:

TABLE_B

code_1

code1_A

 

TABLE_A:

code_1_list

‘code1_A’, ‘code1_B’, ‘code1_C’

 

I need to determine: if *any* observation in TABLE_B is in the list of corresponding column in TABLE_A, I need to include it (or keep/identify it).

By the way, both tables TABLE_A and TABLE_B have 25 of these corresponding columns. Both tables TABLE_A and TABLE_B have ~15k observations on any given day.

Here is a limited example (added notes for clarification):

TABLE_B

ID

code_1

code_2

code_3

code_4

Note (corresponding to TABLE_A)

123

code1_A

code2_B

code3_W

code4_D

In ID: A1, code_1_list and

A3, code_2_list

124

code1_E

code2_B

code3_P

code4_R

In ID: A3, code_2_list

125

code1_H

code2_J

code3_K

code4_E

 

126

code1_A

code2_P

code3_A

code4_K

In ID: A1, code_1_list
also ID: A3 code_2_list

 

TABLE_A

ID

code_1_list

code_2_list

code_3_list

code_4_list

note

A1

'code1_A', 'code1_C'

'code2_A', 'code2_K', 'code2_G'

'code3_F'

'code4_A', 'code4_L'

keep: ID 123, code_1 in TABLE_B

A2

'code1_B', 'code1_C', 'code1_J'

'code2_A', 'code2_G'

'code3_L'

'code4_I', 'code4_R'

 

A3

'code1_A'

'code2_B', 'code2_P'

'code3_F'

'code4_A', 'code4_M'

keep: ID 123, code_1 and
ID 123 & 124, code_2

 

From the example above, only observations A1 and A3 are kept in TABLE_A.

Your thoughts? It seems there might be a much simpler solution that is escaping me...

 

SQL:

proc sql;

     create table

           work.table_ab as

     select distinct

           A.*

     from

           work.TABLE_A A,

           work.TABLE_B B

     where

           (

                A.code_1_list like (cats("'%",B.code_1,"%'"))

                or

                A.code_2_list like (cats("'%",B.code_2,"%'"))

                or

                A.code_3_list like (cats("'%",B.code_3,"%'"))

                or

                A.code_4_list like (cats("'%",B.code_4,"%'"))

           )

     ;

quit;

2 REPLIES 2
ChrisNZ
Tourmaline | Level 20

This could be faster:

where index(A.code_1_list ,B.code_1)
   or index(A.code_2_list ,B.code_2)
   or index(A.code_3_list ,B.code_3)
   or index(A.code_4_list ,B.code_4)

You might need to trim B values if the field is too long for its value

 

s_lassen
Meteorite | Level 14

A data step solution using keyed lookup may be faster.

 

If your data looks something like this:

data b;                                                            
  input id (code_1-code_4) ($);                                    
cards;                                                             
123 code1_A code2_B code3_W code4_D                                
124 code1_E code2_B code3_P code4_R                                
125 code1_H code2_J code3_K code4_E                                
126 code1_A code2_P code3_A code4_K                                
;run;                                                              
data a;                                                            
  infile cards dsd delimiter=';' truncover;                        
  length ID $2 code_1_list code_2_list code_3_list $20;            
  input ID--code_3_list;                                           
cards4;                                                            
A1;'code1_A', 'code1_C';'code2_A', 'code2_K', 'code2_G';'code3_F'  
A2;'code1_B', 'code1_C', 'code1_J';code2_A', 'code2_G';'code3_L'   
A3;'code1_A';'code2_B', 'code2_P';'code3_F';                       
;;;;run;

You first transpose and index the codes in B:

data codes(index=(idx=(num code)) keep=num code);
  set b;                                         
  array codes(*) code_:;                         
  do num=1 to dim(codes);                        
    code=codes(num);                             
    output;                                      
    end;

run;

Then you use keyed lookup to check the A dataset:

data want;                             
  set a;                               
  array lists(*) code_:;               
  do num=1 to dim(lists);              
    do _N_=1 to countw(lists(num));    
      code=scan(lists(num),_N_,"', "); 
      set codes key=idx/unique;        
      if not _iorc_ then do;           
        output;                        
        _error_=0;                     
        delete;                        
        end;                           
      end;                             
    end;                               
  _error_=0;                           
  drop num code;                       
run;

The "_error_=0" statement comes in because whenever a code is not found, it generates an "error". Which you probably do not want to see in the log.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 828 views
  • 0 likes
  • 3 in conversation