I have three data sets:
(1) S1 (n=441000 Unique records KEY (= ID) with 22 columns : ID and A1:A21)
(2) S2 (n=1,467,661,903 records, with duplicate Key (= ID) records with 5 columns : ID, B, C, D and Action)
(3) S3 (n=905 Unique records with 3 columns : X, Action and Action_Desc)
SQL query:
==========
Proc SQL; Select S1.*,S3.Action_Desc from
S1 Left join S2 on s1.id=s2.id
left join s2.action=s3.action;
quit;
HASH CODE:
==========
DATA HASH_ACTION;
IF 0 THEN SET s1;
if _N_ = 1 then do;
declare hash HASH_NAME(dataset: "s1", multidata: 'y');
HASH_NAME.defineKEY("ID");
HASH_NAME.defineData (ALL:'YES');
HASH_NAME.defineDone();
END;
set s2(keep=ID B C D ACTIOn);
IF HASH_NAME.FIND(KEY:ID) = 0 THEN OUTPUT;
RUN;
Question:
(1) I got duplicate records HASH_ACTION dataset, how can I get only unique records.
(2) How can I embed : left join s2.action=s3.action into HASH Code.
So you need to prepare S2 first. Sort by id and action_date_in, and then in a data step keep only the last observation per id. Once you have 1:1 relationships, you'll get no further duplicates.
If I understand your SQL code right then you want actually nothing else than adding column Action_Desc to your S1 table.
If so:
- You could load S2 into the hash (with default only loading first occurrence if duplicate keys but not throwing an error if duplicates).
Assumes that the number of distinct id's is not too high and that there is only one action per id
- Also load S3 into a hash
Something along the line of below (untested) code could work:
data want;
if _n_=1 then
do;
if 0 then set S3(keep=action Action_Desc);
dcl hash hS2(dataset:'s2(keep=id action)');
hS2.defineKey('id');
hS2.defineData('action');
hS2.defineDone();
dcl hash hS3(dataset:'S3(keep=action Action_Desc)');
hS3.defineKey('action');
hS3.defineData('Action_Desc');
hS3.defineDone();
end;
call missing (of _all_);
set s1;
if hS2.find()=0 then hS3.find();
run;
Thanks Patric, for your reply to my post.
In your dataset S2, do you have more than one distinct action per ID? If yes, you will need an iteration to get the same result you got in SQL.
Thanks for your response.
In my dataset S2, I do have more than one distinct action per ID.
I got duplicate records upon my Hash run.
I learned that, I need to eliminate duplicate records, hence, I need to group by ID and take maximum date for ACTION_DATE_IN and ACTION_DATE_OUT.
So you need to prepare S2 first. Sort by id and action_date_in, and then in a data step keep only the last observation per id. Once you have 1:1 relationships, you'll get no further duplicates.
Thanks Patric and KurtBremser, for your reply to my post. I am accepting both response as accepted solutions.
Hi KurtBremser,
I am actually encountering an error with below code for Left join with Hash objects.
I have total 2 columns in Rank_Base & 25 columns in Trans_Prod, both datasets are unsorted. Rank_Base is unique, but Trans_Prod is having the duplicate records of same party_number.
data c2 ;
if _n_ = 1 then do ;
if 0 then set Rank_Base ;
dcl hash Pr_Rank (dataset: "Rank_Base",multidata: "y") ;
Pr_Rank.definekey("party_number");
Pr_Rank.definedata(all:'Y');
Pr_Rank.definedone () ;
dcl hiter iter('Pr_Rank');
end ;
do _n_=1 by 1 until(last);
set Trans_Prod end=last;
array t(100) ;
if Pr_Rank.find()= 0 then do;
if key not in t then t(_n_)=key;
output;
end;
if last then do;
do _n_=1 to dim(t);
if Pr_Rank.check(key:t(_n_))=0 then Pr_Rank.remove(key:t(_n_));
end;
rc = iter.first();
do while (rc = 0);
call missing(risk_rating);
output;
rc = iter.next();
end;
end;
end;
drop rc t:;
run;
Error Log:
25 data c2 ; 26 if _n_ = 1 then do ; 27 if 0 then set Rank_Base ; 28 dcl hash Pr_Rank (dataset: "Rank_Base") ; 29 Pr_Rank.definekey("party_number"); 30 Pr_Rank.definedata(all:'Y'); 31 Pr_Rank.definedone () ; 32 dcl hiter iter('Pr_Rank'); 33 end ; 34 do _n_=1 by 1 until(last); 35 set Trans_Prod end=last; 36 array t(100) ; /*this is arbitrary subscript number*/ 37 if Pr_Rank.find()= 0 then do; 38 if key not in t then t(_n_)=key; 39 output; 40 end; 41 if last then do; 42 do _n_=1 to dim(t); 43 if Pr_Rank.check(key:t(_n_))=0 then Pr_Rank.remove(key:t(_n_)); 44 end; 45 rc = iter.first(); 46 do while (rc = 0); 47 call missing(risk_rating); 48 output; 49 rc = iter.next(); 50 end; 51 end; 52 end; 53 drop rc t:; 54 run; NOTE: Variable key is uninitialized. 2 The SAS System 10:56 Tuesday, September 1, 2020 NOTE: There were 3990011 observations read from the data set RANK_BASE. ERROR: Type mismatch for key variable party_number at line 43 column 5. ERROR: Keys and data must be specified in the same order and have the same types as given in DefineKey/DefineData at line 43 column 5. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase. NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 145185092 observations read from the data set TRANS_PROD. WARNING: The data set C2 may be incomplete. When this step was stopped there were 105030825 observations and 26 variables. WARNING: Data set C2 was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 5:11.08 cpu time 3:34.62
The SQL equivalent query for the above:
select distinct a.*, b.risk_rating
from Trans_Prod a left join Rank_Base b
on strip(a.party_number)=strip(b.party_number)
I am just trying to map the Risk_rating column, from Rank_Base table with hash objects, but I want the distinct result get fetched in the output with all columns from left table and risk_rating column from right table. Also I need 145185092 observations in the output, but the hashing is deleting some records I believe.
I don't want to sort as the Trans_Prod is having 140 Million data & Risk_Base is 39 Million data.
Please help.
Thanks
Rajdeep
Please post your question in a new thread, do not hijack other's.
Post the whole log of the step, with line numbers.
Supply example data in data steps with datalines.
Looks like your variables are of different type in the datasets.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.