Solved: Re: HASH JOIN

GPatel · Posted 03-12-2020 07:28 PM

I have three data sets:

(1) S1 (n=441000 Unique records KEY (= ID) with 22 columns : ID and A1:A21)
(2) S2 (n=1,467,661,903 records, with duplicate Key (= ID) records with 5 columns : ID, B, C, D and Action)
(3) S3 (n=905 Unique records with 3 columns : X, Action and Action_Desc)

SQL query:
==========

Proc SQL; Select S1.*,S3.Action_Desc from
S1 Left join S2 on s1.id=s2.id
left join s2.action=s3.action;
quit;

HASH CODE:
==========

DATA HASH_ACTION;
IF 0 THEN SET s1;
if _N_ = 1 then do;
declare hash HASH_NAME(dataset: "s1", multidata: 'y');
HASH_NAME.defineKEY("ID");
HASH_NAME.defineData (ALL:'YES');
HASH_NAME.defineDone();
END;
set s2(keep=ID B C D ACTIOn);
IF HASH_NAME.FIND(KEY:ID) = 0 THEN OUTPUT;
RUN;

Question:
(1) I got duplicate records HASH_ACTION dataset, how can I get only unique records.
(2) How can I embed : left join s2.action=s3.action into HASH Code.

Kurt_Bremser · Posted 03-13-2020 11:01 AM

So you need to prepare S2 first. Sort by id and action_date_in, and then in a data step keep only the last observation per id. Once you have 1:1 relationships, you'll get no further duplicates.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Patrick · Posted 03-12-2020 08:56 PM

If I understand your SQL code right then you want actually nothing else than adding column Action_Desc to your S1 table.

If so:

- You could load S2 into the hash (with default only loading first occurrence if duplicate keys but not throwing an error if duplicates).

Assumes that the number of distinct id's is not too high and that there is only one action per id

- Also load S3 into a hash

Something along the line of below (untested) code could work:

data want;
  if _n_=1 then
    do;
      if 0 then set S3(keep=action Action_Desc);
      dcl hash hS2(dataset:'s2(keep=id action)');
      hS2.defineKey('id');
      hS2.defineData('action');
      hS2.defineDone();
      dcl hash hS3(dataset:'S3(keep=action Action_Desc)');
      hS3.defineKey('action');
      hS3.defineData('Action_Desc');
      hS3.defineDone();
    end;
  call missing (of _all_);
  set s1;
  if hS2.find()=0 then hS3.find();
run;

GPatel · Posted 03-16-2020 09:51 AM

Thanks Patric, for your reply to my post.

Kurt_Bremser · Posted 03-13-2020 06:21 AM

In your dataset S2, do you have more than one distinct action per ID? If yes, you will need an iteration to get the same result you got in SQL.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

GPatel · Posted 03-13-2020 10:56 AM

Thanks for your response.

In my dataset S2, I do have more than one distinct action per ID.

I got duplicate records upon my Hash run.

I learned that, I need to eliminate duplicate records, hence, I need to group by ID and take maximum date for ACTION_DATE_IN and ACTION_DATE_OUT.

Kurt_Bremser · Posted 03-13-2020 11:01 AM

So you need to prepare S2 first. Sort by id and action_date_in, and then in a data step keep only the last observation per id. Once you have 1:1 relationships, you'll get no further duplicates.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

GPatel · Posted 03-16-2020 09:52 AM

Thanks Patric and KurtBremser, for your reply to my post. I am accepting both response as accepted solutions.

rajdeep · Posted 09-01-2020 04:27 PM

Hi KurtBremser,

I am actually encountering an error with below code for Left join with Hash objects.

I have total 2 columns in Rank_Base & 25 columns in Trans_Prod, both datasets are unsorted. Rank_Base is unique, but Trans_Prod is having the duplicate records of same party_number.

data c2 ;
if _n_ = 1 then do ;
if 0 then set Rank_Base ;
dcl hash Pr_Rank (dataset: "Rank_Base",multidata: "y") ;
Pr_Rank.definekey("party_number");
Pr_Rank.definedata(all:'Y');
Pr_Rank.definedone () ;
dcl hiter iter('Pr_Rank');
end ;
do _n_=1 by 1 until(last);
set Trans_Prod end=last;
array t(100) ; 
if Pr_Rank.find()= 0 then do;
if key not in t then t(_n_)=key;
output;
end;
if last then do;
do _n_=1 to dim(t);
if Pr_Rank.check(key:t(_n_))=0 then Pr_Rank.remove(key:t(_n_));
end;
rc = iter.first();
do while (rc = 0);
call missing(risk_rating);
output;
rc = iter.next();
end;
end;
end;
drop rc t:;
run;

Error Log:

25 data c2 ;
26 if _n_ = 1 then do ;
27 if 0 then set Rank_Base ;
28 dcl hash Pr_Rank (dataset: "Rank_Base") ;
29 Pr_Rank.definekey("party_number");
30 Pr_Rank.definedata(all:'Y');
31 Pr_Rank.definedone () ;
32 dcl hiter iter('Pr_Rank');
33 end ;
34 do _n_=1 by 1 until(last);
35 set Trans_Prod end=last;
36 array t(100) ; /*this is arbitrary subscript number*/
37 if Pr_Rank.find()= 0 then do;
38 if key not in t then t(_n_)=key;
39 output;
40 end;
41 if last then do;
42 do _n_=1 to dim(t);
43 if Pr_Rank.check(key:t(_n_))=0 then Pr_Rank.remove(key:t(_n_));
44 end;
45 rc = iter.first();
46 do while (rc = 0);
47 call missing(risk_rating);
48 output;
49 rc = iter.next();
50 end;
51 end;
52 end;
53 drop rc t:;
54 run;

NOTE: Variable key is uninitialized.
2 The SAS System 10:56 Tuesday, September 1, 2020

NOTE: There were 3990011 observations read from the data set RANK_BASE.
ERROR: Type mismatch for key variable party_number at line 43 column 5.
ERROR: Keys and data must be specified in the same order and have the same types as given in DefineKey/DefineData at line 43 column 
5.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 145185092 observations read from the data set TRANS_PROD.
WARNING: The data set C2 may be incomplete. When this step was stopped there were 105030825 observations and 26 variables.
WARNING: Data set C2 was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 5:11.08
cpu time 3:34.62

The SQL equivalent query for the above:

select distinct a.*, b.risk_rating
from Trans_Prod a left join Rank_Base b
on strip(a.party_number)=strip(b.party_number)

I am just trying to map the Risk_rating column, from Rank_Base table with hash objects, but I want the distinct result get fetched in the output with all columns from left table and risk_rating column from right table. Also I need 145185092 observations in the output, but the hashing is deleting some records I believe.

I don't want to sort as the Trans_Prod is having 140 Million data & Risk_Base is 39 Million data.

Please help.

Thanks

Rajdeep

Kurt_Bremser · Posted 09-01-2020 04:48 PM

Please post your question in a new thread, do not hijack other's.

Post the whole log of the step, with line numbers.

Supply example data in data steps with datalines.

Looks like your variables are of different type in the datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rajdeep · Posted 09-01-2020 08:24 PM

I think I was missing the call missing statement in my code.

data want;
set Trans_Prod;
if _n_ = 1 then do ;
if 0 then set Rank_Base ;
dcl hash Pr_Rank (dataset: "Rank_Base") ;
Pr_Rank.definekey("party_number");
Pr_Rank.definedata(all:'Y');
Pr_Rank.definedone () ;
end;
if Pr_Rank.find() ne 0 then call missing(RISK_RATING);
run;

This code is working as per the expectation like left join.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away