As an alternative to PROC SQL for speed, I want to use hash tables to left join multiple tables.
I have 4 tables - A, B, C, D. Tables B, C, and D each have their own 'key' fields that table A possesses as well. I am trying to left join B, C and D onto A on the basis of this. The below is my code, it is based on this article: https://www.lexjansen.com/nesug/nesug06/dm/da07.pdf
DATA COMBINED;
IF 0 THEN SET A B C D;
IF _N_ = 1 THEN DO;
DECLARE HASH B_(dataset: 'B');
B_.defineKey('B_key');
B_.defineData('B_data');
B_.defineDone();
DECLARE HASH C_(dataset: 'C');
C_.defineKey('C_key');
C_.defineData('C_data');
C_.defineDone();
DECLARE HASH C_(dataset: 'D');
D_.defineKey('D_key');
D_.defineData('D_data');
D_.defineDone();
END;
DO UNTIL (EOF);
SET A END = EOF
IF B_.FIND(KEY:'A_KEY')=0
THEN OUTPUT;
ELSE IF C_.FIND(KEY:'A_KEY')=0
THEN OUTPUT;
ELSE IF D_.FIND(KEY:'A_KEY')=0
THEN OUTPUT;
ELSE DO;
CALL MISSING(OF 'B_data', 'C_data', 'D_data');
OUTPUT;
END;
END;
STOP;
RUN;
The trouble is at the end; the iteration through the table 'A' doesn't go the way I'd like, and it produces some strange results (like values being filled even though I wouldn't expect there to be a hash key match). Does anyone know of the correct way to iterate through a table and match multiple hash keys to simulate a SQL left join?
Much appreciated.
Getting sample data via a fully working SAS datastep would allow respondents to spend the time on your actual question. Done that now anyway for you - see below.
The important bit is the call missing() before the Set of the master table. This to always re-initiate the lookup columns as else the values for these get retained in case of an unsuccessful lookup (because not re-initiated via the table in the set statement where these columns don't exist).
data a;
input (A1 A2 A3) ($);
datalines;
A BB 53
B CC 24
C FF 34
D RR 33
;
data b;
input (B1 B2) ($);
datalines;
A 4
D 3
;
data c;
input (C1 C2) ($);
datalines;
BB 7
CC 9
;
data d;
input (D1 D2) ($);
datalines;
24 1
34 5
;
/**
Table B, joined on A1 = B1:
Table C, joined on A2 = C1
Table D, joined on A3 = D1
**/
data want(drop=b1 c1 d1 _rc);
if _n_=1 then
do;
/* map tables */
if 0 then set a b c d;
/* define and load hashes */
dcl hash h_b(dataset:'b');
h_b.defineKey('b1');
h_b.defineData('b2');
h_b.defineDone();
dcl hash h_c(dataset:'c');
h_c.defineKey('c1');
h_c.defineData('c2');
h_c.defineDone();
dcl hash h_d(dataset:'d');
h_d.defineKey('d1');
h_d.defineData('d2');
h_d.defineDone();
end;
call missing(of _all_);
set a;
_rc=h_b.find(key:a1);
_rc=h_c.find(key:a2);
_rc=h_d.find(key:a3);
run;
proc print data=want;
run;
In the "do until (eof)" do group, the only time you clear B_DATA, C_DATA, and D_DATA is when a given A_KEY in not in any of the hash objects.
Put another way, a series of consecutive successful FINDs to any sequence of the hash objects means the output data will always have the most recent successful find's from each of the objects. I.e. values are "filled" even when you don't have a successful match to a given hash object. You could have B_DATA from the current key, C_DATA from the prior key, and D_DATA from the key prior to that.
Is that what you want?
Please provide a working DATA step with sample input data, and the results you intend to generate from that sample.
Thank you - I think that is indeed the cause of the issue I was describing. I am very new to SAS and I'm not sure how I could fix it - any ideas?
The below is a simple illustration of what I'm trying to do:
Table A:
A1 A2 A3
A BB 53
B CC 24
C FF 34
D RR 33
Table B, joined on A1 = B1:
B1 B2
A 4
D 3
Table C, joined on A2 = C1
C1 C2
BB 7
CC 9
Table D, joined on A3 = D1
D1 D2
24 1
34 5
The final result of left joins would be:
A1 A2 A3 B1 C1 D1
A BB 53 4 7 .
B CC 24 . 9 1
C FF 34 . . 5
D RR 33 4 . .
Any help is much appreciated.
Getting sample data via a fully working SAS datastep would allow respondents to spend the time on your actual question. Done that now anyway for you - see below.
The important bit is the call missing() before the Set of the master table. This to always re-initiate the lookup columns as else the values for these get retained in case of an unsuccessful lookup (because not re-initiated via the table in the set statement where these columns don't exist).
data a;
input (A1 A2 A3) ($);
datalines;
A BB 53
B CC 24
C FF 34
D RR 33
;
data b;
input (B1 B2) ($);
datalines;
A 4
D 3
;
data c;
input (C1 C2) ($);
datalines;
BB 7
CC 9
;
data d;
input (D1 D2) ($);
datalines;
24 1
34 5
;
/**
Table B, joined on A1 = B1:
Table C, joined on A2 = C1
Table D, joined on A3 = D1
**/
data want(drop=b1 c1 d1 _rc);
if _n_=1 then
do;
/* map tables */
if 0 then set a b c d;
/* define and load hashes */
dcl hash h_b(dataset:'b');
h_b.defineKey('b1');
h_b.defineData('b2');
h_b.defineDone();
dcl hash h_c(dataset:'c');
h_c.defineKey('c1');
h_c.defineData('c2');
h_c.defineDone();
dcl hash h_d(dataset:'d');
h_d.defineKey('d1');
h_d.defineData('d2');
h_d.defineDone();
end;
call missing(of _all_);
set a;
_rc=h_b.find(key:a1);
_rc=h_c.find(key:a2);
_rc=h_d.find(key:a3);
run;
proc print data=want;
run;
Thank you! This works well. Sorry about the DATA statements, I am new to SAS and didn't know how to do that.
How are you able to set _rc to equal more than one thing?
Variable _rc in my code is just used to capture the return code from calling the hash find() method so it doesn't get written to the SAS log. As I'm not doing anything with _rc I also don't care if consecutive method calls overwrite what previous calls already populated.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.