Solved: Re: Multiple left join using SAS hash objects

DAppelbaum83 · Posted 04-25-2021 07:09 PM

As an alternative to PROC SQL for speed, I want to use hash tables to left join multiple tables.

I have 4 tables - A, B, C, D. Tables B, C, and D each have their own 'key' fields that table A possesses as well. I am trying to left join B, C and D onto A on the basis of this. The below is my code, it is based on this article: https://www.lexjansen.com/nesug/nesug06/dm/da07.pdf

DATA COMBINED;

IF 0 THEN SET A B C D;

IF _N_ = 1 THEN DO;

DECLARE HASH B_(dataset: 'B');

B_.defineKey('B_key');

B_.defineData('B_data');

B_.defineDone();

DECLARE HASH C_(dataset: 'C');

C_.defineKey('C_key');

C_.defineData('C_data');

C_.defineDone();

DECLARE HASH C_(dataset: 'D');

D_.defineKey('D_key');

D_.defineData('D_data');

D_.defineDone();

END;

DO UNTIL (EOF);

SET A END = EOF

IF B_.FIND(KEY:'A_KEY')=0

THEN OUTPUT;

ELSE IF C_.FIND(KEY:'A_KEY')=0

THEN OUTPUT;

ELSE IF D_.FIND(KEY:'A_KEY')=0

THEN OUTPUT;

ELSE DO;

CALL MISSING(OF 'B_data', 'C_data', 'D_data');

OUTPUT;

END;

STOP;

RUN;

The trouble is at the end; the iteration through the table 'A' doesn't go the way I'd like, and it produces some strange results (like values being filled even though I wouldn't expect there to be a hash key match). Does anyone know of the correct way to iterate through a table and match multiple hash keys to simulate a SQL left join?

Much appreciated.

Patrick · Posted 04-26-2021 02:54 AM

Getting sample data via a fully working SAS datastep would allow respondents to spend the time on your actual question. Done that now anyway for you - see below.

The important bit is the call missing() before the Set of the master table. This to always re-initiate the lookup columns as else the values for these get retained in case of an unsuccessful lookup (because not re-initiated via the table in the set statement where these columns don't exist).

data a;
  input (A1 A2 A3) ($);
  datalines;
A BB 53 
B CC 24 
C FF 34 
D RR 33 
;
data b;
  input (B1 B2) ($);
  datalines;
A 4
D 3
;

data c;
  input (C1 C2) ($);
  datalines;
BB 7
CC 9
;
data d;
  input (D1 D2) ($);
  datalines;
24 1
34 5 
;

/**
Table B, joined on A1 = B1:
Table C, joined on A2 = C1
Table D, joined on A3 = D1
**/

data want(drop=b1 c1 d1 _rc);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set a b c d;
      /* define and load hashes */
      dcl hash h_b(dataset:'b');
      h_b.defineKey('b1');
      h_b.defineData('b2');
      h_b.defineDone();
      dcl hash h_c(dataset:'c');
      h_c.defineKey('c1');
      h_c.defineData('c2');
      h_c.defineDone();
      dcl hash h_d(dataset:'d');
      h_d.defineKey('d1');
      h_d.defineData('d2');
      h_d.defineDone();
    end;
  call missing(of _all_);

  set a;
  _rc=h_b.find(key:a1);
  _rc=h_c.find(key:a2);
  _rc=h_d.find(key:a3);
run;

proc print data=want;
run;

View solution in original post

mkeintz · Posted 04-25-2021 11:06 PM

In the "do until (eof)" do group, the only time you clear B_DATA, C_DATA, and D_DATA is when a given A_KEY in not in any of the hash objects.

Put another way, a series of consecutive successful FINDs to any sequence of the hash objects means the output data will always have the most recent successful find's from each of the objects. I.e. values are "filled" even when you don't have a successful match to a given hash object. You could have B_DATA from the current key, C_DATA from the prior key, and D_DATA from the key prior to that.

Is that what you want?

Please provide a working DATA step with sample input data, and the results you intend to generate from that sample.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

DAppelbaum83 · Posted 04-26-2021 02:21 AM

Thank you - I think that is indeed the cause of the issue I was describing. I am very new to SAS and I'm not sure how I could fix it - any ideas?

The below is a simple illustration of what I'm trying to do:

Table A:

A1 A2 A3

A BB 53

B CC 24

C FF 34

D RR 33

Table B, joined on A1 = B1:

B1 B2

A 4

D 3

Table C, joined on A2 = C1

C1 C2

BB 7

CC 9

Table D, joined on A3 = D1

D1 D2

24 1

34 5

The final result of left joins would be:

A1 A2 A3 B1 C1 D1

A BB 53 4 7 .

B CC 24 . 9 1

C FF 34 . . 5

D RR 33 4 . .

Any help is much appreciated.

Patrick · Posted 04-26-2021 02:54 AM

Getting sample data via a fully working SAS datastep would allow respondents to spend the time on your actual question. Done that now anyway for you - see below.

The important bit is the call missing() before the Set of the master table. This to always re-initiate the lookup columns as else the values for these get retained in case of an unsuccessful lookup (because not re-initiated via the table in the set statement where these columns don't exist).

data a;
  input (A1 A2 A3) ($);
  datalines;
A BB 53 
B CC 24 
C FF 34 
D RR 33 
;
data b;
  input (B1 B2) ($);
  datalines;
A 4
D 3
;

data c;
  input (C1 C2) ($);
  datalines;
BB 7
CC 9
;
data d;
  input (D1 D2) ($);
  datalines;
24 1
34 5 
;

/**
Table B, joined on A1 = B1:
Table C, joined on A2 = C1
Table D, joined on A3 = D1
**/

data want(drop=b1 c1 d1 _rc);
  if _n_=1 then
    do;
      /* map tables */
      if 0 then set a b c d;
      /* define and load hashes */
      dcl hash h_b(dataset:'b');
      h_b.defineKey('b1');
      h_b.defineData('b2');
      h_b.defineDone();
      dcl hash h_c(dataset:'c');
      h_c.defineKey('c1');
      h_c.defineData('c2');
      h_c.defineDone();
      dcl hash h_d(dataset:'d');
      h_d.defineKey('d1');
      h_d.defineData('d2');
      h_d.defineDone();
    end;
  call missing(of _all_);

  set a;
  _rc=h_b.find(key:a1);
  _rc=h_c.find(key:a2);
  _rc=h_d.find(key:a3);
run;

proc print data=want;
run;

DAppelbaum83 · Posted 04-26-2021 08:09 PM

Thank you! This works well. Sorry about the DATA statements, I am new to SAS and didn't know how to do that.

How are you able to set _rc to equal more than one thing?

Patrick · Posted 04-27-2021 05:17 PM

Variable _rc in my code is just used to capture the return code from calling the hash find() method so it doesn't get written to the SAS log. As I'm not doing anything with _rc I also don't care if consecutive method calls overwrite what previous calls already populated.

Multiple left join using SAS hash objects

Re: Multiple left join using SAS hash objects

Re: Multiple left join using SAS hash objects

Re: Multiple left join using SAS hash objects

Re: Multiple left join using SAS hash objects

Re: Multiple left join using SAS hash objects

Re: Multiple left join using SAS hash objects

Registration is open

SAS Training: Just a Click Away