I have two tables: TABLE_A and TABLE_B
Both tables TABLE_A and TABLE_B are updated daily. I’d like to determine which observations in TABLE_A are “affected” by updates in TABLE_B. This is very specific business requirement. What makes it challenging is that there is no key that can be used to join the tables directly. The PROC SQL I have takes more than half hour to complete. See the code below. Wondering if anyone has a better solution.
Note on the columns. Let’s say TABLE_B has a column called code_1, while TABLE_A has a corresponding column code_1_list. Values in code_1 column are alpha-numerical, while code_1_list values are quoted, and comma separated. For example:
TABLE_B
code_1 |
code1_A |
TABLE_A:
code_1_list |
‘code1_A’, ‘code1_B’, ‘code1_C’ |
I need to determine: if *any* observation in TABLE_B is in the list of corresponding column in TABLE_A, I need to include it (or keep/identify it).
By the way, both tables TABLE_A and TABLE_B have 25 of these corresponding columns. Both tables TABLE_A and TABLE_B have ~15k observations on any given day.
Here is a limited example (added notes for clarification):
TABLE_B
ID | code_1 | code_2 | code_3 | code_4 | Note (corresponding to TABLE_A) |
123 | code1_A | code2_B | code3_W | code4_D | In ID: A1, code_1_list and A3, code_2_list |
124 | code1_E | code2_B | code3_P | code4_R | In ID: A3, code_2_list |
125 | code1_H | code2_J | code3_K | code4_E |
|
126 | code1_A | code2_P | code3_A | code4_K | In ID: A1, code_1_list |
TABLE_A
ID | code_1_list | code_2_list | code_3_list | code_4_list | note |
A1 | 'code1_A', 'code1_C' | 'code2_A', 'code2_K', 'code2_G' | 'code3_F' | 'code4_A', 'code4_L' | keep: ID 123, code_1 in TABLE_B |
A2 | 'code1_B', 'code1_C', 'code1_J' | 'code2_A', 'code2_G' | 'code3_L' | 'code4_I', 'code4_R' |
|
A3 | 'code1_A' | 'code2_B', 'code2_P' | 'code3_F' | 'code4_A', 'code4_M' | keep: ID 123, code_1 and |
From the example above, only observations A1 and A3 are kept in TABLE_A.
Your thoughts? It seems there might be a much simpler solution that is escaping me...
SQL:
proc sql;
create table
work.table_ab as
select distinct
A.*
from
work.TABLE_A A,
work.TABLE_B B
where
(
A.code_1_list like (cats("'%",B.code_1,"%'"))
or
A.code_2_list like (cats("'%",B.code_2,"%'"))
or
A.code_3_list like (cats("'%",B.code_3,"%'"))
or
A.code_4_list like (cats("'%",B.code_4,"%'"))
)
;
quit;
This could be faster:
where index(A.code_1_list ,B.code_1)
or index(A.code_2_list ,B.code_2)
or index(A.code_3_list ,B.code_3)
or index(A.code_4_list ,B.code_4)
You might need to trim B values if the field is too long for its value
A data step solution using keyed lookup may be faster.
If your data looks something like this:
data b;
input id (code_1-code_4) ($);
cards;
123 code1_A code2_B code3_W code4_D
124 code1_E code2_B code3_P code4_R
125 code1_H code2_J code3_K code4_E
126 code1_A code2_P code3_A code4_K
;run;
data a;
infile cards dsd delimiter=';' truncover;
length ID $2 code_1_list code_2_list code_3_list $20;
input ID--code_3_list;
cards4;
A1;'code1_A', 'code1_C';'code2_A', 'code2_K', 'code2_G';'code3_F'
A2;'code1_B', 'code1_C', 'code1_J';code2_A', 'code2_G';'code3_L'
A3;'code1_A';'code2_B', 'code2_P';'code3_F';
;;;;run;
You first transpose and index the codes in B:
data codes(index=(idx=(num code)) keep=num code);
set b;
array codes(*) code_:;
do num=1 to dim(codes);
code=codes(num);
output;
end;
run;
Then you use keyed lookup to check the A dataset:
data want;
set a;
array lists(*) code_:;
do num=1 to dim(lists);
do _N_=1 to countw(lists(num));
code=scan(lists(num),_N_,"', ");
set codes key=idx/unique;
if not _iorc_ then do;
output;
_error_=0;
delete;
end;
end;
end;
_error_=0;
drop num code;
run;
The "_error_=0" statement comes in because whenever a code is not found, it generates an "error". Which you probably do not want to see in the log.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.