BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GPatel
Pyrite | Level 9


I have three data sets:

(1) S1 (n=441000 Unique records KEY (= ID) with 22 columns : ID and A1:A21)
(2) S2 (n=1,467,661,903 records, with duplicate Key (= ID) records with 5 columns : ID, B, C, D and Action)
(3) S3 (n=905 Unique records with 3 columns : X, Action and Action_Desc)

 

SQL query:
==========

Proc SQL; Select S1.*,S3.Action_Desc from
S1 Left join S2 on s1.id=s2.id
left join s2.action=s3.action;
quit;

 

HASH CODE:
==========

DATA HASH_ACTION;
IF 0 THEN SET s1;
if _N_ = 1 then do;
declare hash HASH_NAME(dataset: "s1", multidata: 'y');
HASH_NAME.defineKEY("ID");
HASH_NAME.defineData (ALL:'YES');
HASH_NAME.defineDone();
END;
set s2(keep=ID B C D ACTIOn);
IF HASH_NAME.FIND(KEY:ID) = 0 THEN OUTPUT;
RUN;


Question:
(1) I got duplicate records HASH_ACTION dataset, how can I get only unique records.
(2) How can I embed : left join s2.action=s3.action into HASH Code.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

So you need to prepare S2 first. Sort by id and action_date_in, and then in a data step keep only the last observation per id. Once you have 1:1 relationships, you'll get no further duplicates.

View solution in original post

9 REPLIES 9
Patrick
Opal | Level 21

If I understand your SQL code right then you want actually nothing else than adding column Action_Desc to your S1 table.

 

If so:

- You could load S2 into the hash (with default only loading first occurrence if duplicate keys but not throwing an error if duplicates). 

   Assumes that the number of distinct id's is not too high and that there is only one action per id

- Also load S3 into a hash

 

Something along the line of below (untested) code could work:

 

data want;
  if _n_=1 then
    do;
      if 0 then set S3(keep=action Action_Desc);
      dcl hash hS2(dataset:'s2(keep=id action)');
      hS2.defineKey('id');
      hS2.defineData('action');
      hS2.defineDone();
      dcl hash hS3(dataset:'S3(keep=action Action_Desc)');
      hS3.defineKey('action');
      hS3.defineData('Action_Desc');
      hS3.defineDone();
    end;
  call missing (of _all_);
  set s1;
  if hS2.find()=0 then hS3.find();
run;

 

GPatel
Pyrite | Level 9

Thanks Patric, for your reply to my post. 

Kurt_Bremser
Super User

In your dataset S2, do you have more than one distinct action per ID? If yes, you will need an iteration to get the same result you got in SQL.

GPatel
Pyrite | Level 9

Thanks for your response.

 

In my dataset S2, I do have more than one distinct action per ID. 

 

I got duplicate records upon my Hash run.

 

I learned that, I need to eliminate duplicate records, hence,  I need to group by ID and take maximum date for ACTION_DATE_IN  and ACTION_DATE_OUT. 

Kurt_Bremser
Super User

So you need to prepare S2 first. Sort by id and action_date_in, and then in a data step keep only the last observation per id. Once you have 1:1 relationships, you'll get no further duplicates.

GPatel
Pyrite | Level 9

Thanks Patric and KurtBremser, for your reply to my post. I am accepting both response as accepted solutions.

rajdeep
Pyrite | Level 9

Hi  KurtBremser,

I am actually encountering an error with below code for Left join with Hash objects.

I have total 2 columns in Rank_Base & 25 columns in Trans_Prod, both datasets are unsorted. Rank_Base  is unique, but Trans_Prod is having the duplicate records of same party_number.

 

data c2 ;
if _n_ = 1 then do ;
if 0 then set Rank_Base ;
dcl hash Pr_Rank (dataset: "Rank_Base",multidata: "y") ;
Pr_Rank.definekey("party_number");
Pr_Rank.definedata(all:'Y');
Pr_Rank.definedone () ;
dcl hiter iter('Pr_Rank');
end ;
do _n_=1 by 1 until(last);
set Trans_Prod end=last;
array t(100) ; 
if Pr_Rank.find()= 0 then do;
if key not in t then t(_n_)=key;
output;
end;
if last then do;
do _n_=1 to dim(t);
if Pr_Rank.check(key:t(_n_))=0 then Pr_Rank.remove(key:t(_n_));
end;
rc = iter.first();
do while (rc = 0);
call missing(risk_rating);
output;
rc = iter.next();
end;
end;
end;
drop rc t:;
run;

 

Error Log:

25 data c2 ;
26 if _n_ = 1 then do ;
27 if 0 then set Rank_Base ;
28 dcl hash Pr_Rank (dataset: "Rank_Base") ;
29 Pr_Rank.definekey("party_number");
30 Pr_Rank.definedata(all:'Y');
31 Pr_Rank.definedone () ;
32 dcl hiter iter('Pr_Rank');
33 end ;
34 do _n_=1 by 1 until(last);
35 set Trans_Prod end=last;
36 array t(100) ; /*this is arbitrary subscript number*/
37 if Pr_Rank.find()= 0 then do;
38 if key not in t then t(_n_)=key;
39 output;
40 end;
41 if last then do;
42 do _n_=1 to dim(t);
43 if Pr_Rank.check(key:t(_n_))=0 then Pr_Rank.remove(key:t(_n_));
44 end;
45 rc = iter.first();
46 do while (rc = 0);
47 call missing(risk_rating);
48 output;
49 rc = iter.next();
50 end;
51 end;
52 end;
53 drop rc t:;
54 run;

NOTE: Variable key is uninitialized.
2 The SAS System 10:56 Tuesday, September 1, 2020

NOTE: There were 3990011 observations read from the data set RANK_BASE.
ERROR: Type mismatch for key variable party_number at line 43 column 5.
ERROR: Keys and data must be specified in the same order and have the same types as given in DefineKey/DefineData at line 43 column 
5.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 145185092 observations read from the data set TRANS_PROD.
WARNING: The data set C2 may be incomplete. When this step was stopped there were 105030825 observations and 26 variables.
WARNING: Data set C2 was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 5:11.08
cpu time 3:34.62

 

The SQL equivalent query for the above:

select distinct a.*, b.risk_rating
from Trans_Prod a left join Rank_Base b
on strip(a.party_number)=strip(b.party_number)

 

I am just trying to map the Risk_rating column, from Rank_Base table with hash objects, but I want the distinct result get fetched in the output with all columns from left table and risk_rating column from right table. Also I need 145185092 observations in the output, but the hashing is deleting some records I believe.

I don't want to sort as the Trans_Prod is having 140 Million data & Risk_Base is 39 Million data.

 

Please help.

 

Thanks

Rajdeep

Kurt_Bremser
Super User

Please post your question in a new thread, do not hijack other's.

Post the whole log of the step, with line numbers.

Supply example data in data steps with datalines.

 

Looks like your variables are of different type in the datasets.

rajdeep
Pyrite | Level 9
I think I was missing the call missing statement in my code.

data want;
set Trans_Prod;
if _n_ = 1 then do ;
if 0 then set Rank_Base ;
dcl hash Pr_Rank (dataset: "Rank_Base") ;
Pr_Rank.definekey("party_number");
Pr_Rank.definedata(all:'Y');
Pr_Rank.definedone () ;
end;
if Pr_Rank.find() ne 0 then call missing(RISK_RATING);
run;

This code is working as per the expectation like left join.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1383 views
  • 0 likes
  • 4 in conversation