BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DAppelbaum83
Fluorite | Level 6

As an alternative to PROC SQL for speed, I want to use hash tables to left join multiple tables. 

I have 4 tables - A, B, C, D. Tables B, C, and D each have their own 'key' fields that table A possesses as well. I am trying to left join B, C and D onto A on the basis of this. The below is my code, it is based on this article: https://www.lexjansen.com/nesug/nesug06/dm/da07.pdf

 

DATA COMBINED;

IF 0 THEN SET A B C D;

IF _N_ = 1 THEN DO;

 

DECLARE HASH B_(dataset: 'B');

     B_.defineKey('B_key');

     B_.defineData('B_data');

     B_.defineDone();

DECLARE HASH C_(dataset: 'C');

     C_.defineKey('C_key');

     C_.defineData('C_data');

     C_.defineDone();

DECLARE HASH C_(dataset: 'D');

     D_.defineKey('D_key');

     D_.defineData('D_data');

     D_.defineDone();

END;

 

DO UNTIL (EOF);

    SET A END = EOF

    IF B_.FIND(KEY:'A_KEY')=0

       THEN OUTPUT;

    ELSE IF C_.FIND(KEY:'A_KEY')=0

       THEN OUTPUT;

    ELSE IF D_.FIND(KEY:'A_KEY')=0

       THEN OUTPUT;

    ELSE DO; 

        CALL MISSING(OF 'B_data', 'C_data', 'D_data');

        OUTPUT;

     END;

    END;

   STOP;

RUN;

 

The trouble is at the end; the iteration through the table 'A' doesn't go the way I'd like, and it produces some strange results (like values being filled even though I wouldn't expect there to be a hash key match). Does anyone know of the correct way to iterate through a table and match multiple hash keys to simulate a SQL left join?

 

Much appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Getting sample data via a fully working SAS datastep would allow respondents to spend the time on your actual question. Done that now anyway for you - see below.

The important bit is the call missing() before the Set of the master table. This to always re-initiate the lookup columns as else the values for these get retained in case of an unsuccessful lookup (because not re-initiated via the table in the set statement where these columns don't exist).

data a;
  input (A1 A2 A3) ($);
  datalines;
A BB 53 
B CC 24 
C FF 34 
D RR 33 
;
data b;
  input (B1 B2) ($);
  datalines;
A 4
D 3
;

data c;
  input (C1 C2) ($);
  datalines;
BB 7
CC 9
;
data d;
  input (D1 D2) ($);
  datalines;
24 1
34 5 
;

/**
Table B, joined on A1 = B1:
Table C, joined on A2 = C1
Table D, joined on A3 = D1
**/

data want(drop=b1 c1 d1 _rc);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set a b c d;
      /* define and load hashes */
      dcl hash h_b(dataset:'b');
      h_b.defineKey('b1');
      h_b.defineData('b2');
      h_b.defineDone();
      dcl hash h_c(dataset:'c');
      h_c.defineKey('c1');
      h_c.defineData('c2');
      h_c.defineDone();
      dcl hash h_d(dataset:'d');
      h_d.defineKey('d1');
      h_d.defineData('d2');
      h_d.defineDone();
    end;
  call missing(of _all_);

  set a;
  _rc=h_b.find(key:a1);
  _rc=h_c.find(key:a2);
  _rc=h_d.find(key:a3);
run;

proc print data=want;
run;

 

View solution in original post

5 REPLIES 5
mkeintz
PROC Star

In the "do until (eof)" do group, the only time you clear B_DATA, C_DATA, and D_DATA is when a given A_KEY in not in any of the hash objects.

 

Put another way, a series of consecutive successful FINDs to any sequence of the hash objects means the output data will always have the most recent successful find's from each of the objects.  I.e. values are "filled" even when you don't have a successful match to a given hash object.  You could have  B_DATA from the current key, C_DATA from the prior key, and D_DATA from the key prior to that.

 

Is that what you want?

 

Please provide a working DATA step with sample input data, and the results you intend to generate from that sample.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
DAppelbaum83
Fluorite | Level 6

Thank you - I think that is indeed the cause of the issue I was describing. I am very new to SAS and I'm not sure how I could fix it - any ideas?

 

The below is a simple illustration of what I'm trying to do:

 

Table A:

 

A1  A2   A3 

A    BB   53  

B    CC   24  

C    FF   34    

D    RR  33   

 

Table B, joined on A1 = B1:

 

B1 B2

A    4

D   3

 

Table C, joined on A2 = C1

 

C1   C2

BB   7

CC   9

 

Table D, joined on A3 = D1

 

D1   D2

24    1

34    5 

 

The final result of left joins would be:

 

A1  A2   A3  B1  C1 D1 

A    BB   53   4    7    .

B    CC   24   .    9    1

C    FF   34    .    .     5

D    RR  33    4    .     .

 

 Any help is much appreciated. 

Patrick
Opal | Level 21

Getting sample data via a fully working SAS datastep would allow respondents to spend the time on your actual question. Done that now anyway for you - see below.

The important bit is the call missing() before the Set of the master table. This to always re-initiate the lookup columns as else the values for these get retained in case of an unsuccessful lookup (because not re-initiated via the table in the set statement where these columns don't exist).

data a;
  input (A1 A2 A3) ($);
  datalines;
A BB 53 
B CC 24 
C FF 34 
D RR 33 
;
data b;
  input (B1 B2) ($);
  datalines;
A 4
D 3
;

data c;
  input (C1 C2) ($);
  datalines;
BB 7
CC 9
;
data d;
  input (D1 D2) ($);
  datalines;
24 1
34 5 
;

/**
Table B, joined on A1 = B1:
Table C, joined on A2 = C1
Table D, joined on A3 = D1
**/

data want(drop=b1 c1 d1 _rc);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set a b c d;
      /* define and load hashes */
      dcl hash h_b(dataset:'b');
      h_b.defineKey('b1');
      h_b.defineData('b2');
      h_b.defineDone();
      dcl hash h_c(dataset:'c');
      h_c.defineKey('c1');
      h_c.defineData('c2');
      h_c.defineDone();
      dcl hash h_d(dataset:'d');
      h_d.defineKey('d1');
      h_d.defineData('d2');
      h_d.defineDone();
    end;
  call missing(of _all_);

  set a;
  _rc=h_b.find(key:a1);
  _rc=h_c.find(key:a2);
  _rc=h_d.find(key:a3);
run;

proc print data=want;
run;

 

DAppelbaum83
Fluorite | Level 6

Thank you! This works well. Sorry about the DATA statements, I am new to SAS and didn't know how to do that.

 

How are you able to set _rc to equal more than one thing? 

Patrick
Opal | Level 21

Variable _rc in my code is just used to capture the return code from calling the hash find() method so it doesn't get written to the SAS log. As I'm not doing anything with _rc I also don't care if consecutive method calls overwrite what previous calls already populated.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 2829 views
  • 2 likes
  • 3 in conversation