BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

I would like to do a join using sas hash object, and get the same result as 4 lines of proc sql follows:

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
run;
proc sql;
create table sqljoin_dataset as
select A.*
from dataset_A A
join dataset_B B 
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;

As in the clause "ON" are present the functions "compress" and “upcase", the following hash procedure does not allow to obtain the same result (1 row rather than 4):

 

data  hashjoin_dataset ;
set dataset_A;key=compress(upcase(key));
if  _N_ = 1 then do;
	declare hash T(dataset: 'dataset_B');
		    T.definekey(compress(upcase('key')));
		    T.definedata (all:'yes');
		    T.definedone();
			 key=compress(upcase(key));
	end;
	key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find(); 
run;

Does anyone know if it is possible, through a single data step, to obtain with hash object the same result of proc sql? Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
FriedEgg
SAS Employee
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
;
run;

proc ds2;

data hashjoin_dataset(overwrite=yes);
dcl double val1;
dcl package hash h ([key],[key,val1],0,'{select upcase(compress(key)) as key,val1 from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;

quit;

View solution in original post

15 REPLIES 15
LinusH
Tourmaline | Level 20

May I ask what requirement drives this inquiry?

Just for learning and fun? Smiley Very Happy

 

Since I can't see any other reason to switch from the simplicity of SQL unless you have severe performance problems.

Data never sleeps
mariopellegrini
Quartz | Level 8
The reason is of course the improvement in processing time on large datasets
FriedEgg
SAS Employee
* modify key for hash below with view;

data v_dataset_B/view=v_dataset_B;
  set dataset_B;
  key=compress(upcase(key));
run;

data  hashjoin_dataset ;
if 0 then set v_dataset_B;
if _N_ = 1 then do; 
  declare hash T(dataset: 'v_dataset_B'); 
    T.definekey('key'); 
    T.definedata (all:'yes'); 
    T.definedone(); 
end; 
set dataset_A;
key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find(); 
run;
jklaverstijn
Rhodochrosite | Level 12

Assuming dataset_B is the smaller I would preprocess it and replace key with conpress(uocase(key)). This way it gets the proper value in the has table without issue.

 

Then find() will find it as you do the same with key from dataset_A.

data dataset_b2;
set dataset_b;
key=compress(upcase(key));
run;
data hashjoin_dataset ; set dataset_A;key=compress(upcase(key)); if _N_ = 1 then do; declare hash T(dataset: 'dataset_B'); T.definekey(compress(upcase('key'))); T.definedata (all:'yes'); T.definedone(); /* key=compress(upcase(key)); */ end; key=compress(upcase(key)); if T.find() = 0; /* join */ drop rc; rc = t.find(); run;

The compress(upcase()) doesn't do anything when loading the hash table so you can take that out.

 

Hope this helps,

- Jan.

 

** edited for completenesss of code **

 

mariopellegrini
Quartz | Level 8

Thanks, but I have the need to use only a data step

jklaverstijn
Rhodochrosite | Level 12

@mario_pellegrini wrote:

Thanks, but I have the need to use only a data step


I am curious as to why that need exists. Can you explain? People are coding for you and this way you present a moving target.

 

Regards Jan

Ksharp
Super User
Why do you have to use a single data step? Does FriedEgg's code make some sense ?


data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
EEe
;
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
;
run;
data want;
 if _n_=1 then do;
  if 0 then set dataset_B;
  declare hash h();
  h.definekey('key');
  h.definedone();
  
  do until(last);
   set dataset_B end=last;
   key=compress(upcase(key));
   h.replace();
  end;
 
 end;
set dataset_B;
key=compress(upcase(key));
if h.check()=0;
run;


mario_pellegrin
Calcite | Level 5

But it is never used dataset A

jklaverstijn
Rhodochrosite | Level 12

@mario_pellegrin wrote:

But it is never used dataset A


That's undoubtedly a typo.

LinusH
Tourmaline | Level 20
In my experience, things seldom "has" to be in a certain way. Most resolutions are compromises. If you are worried about disk rewrite in the extra data step, just define it as a view. Should pose very little overhead if you got memory enough.

If you worry about the performance of your database, you should concern more about how you store your data. Like having join keys that never needs a function call to work. That should have been taken care of during data load.
Data never sleeps
jklaverstijn
Rhodochrosite | Level 12

@LinusH wrote:
In my experience, things seldom "has" to be in a certain way. Most resolutions are compromises. If you are worried about disk rewrite in the extra data step, just define it as a view. Should pose very little overhead if you got memory enough.

If you worry about the performance of your database, you should concern more about how you store your data. Like having join keys that never needs a function call to work. That should have been taken care of during data load.

I totally agree @LinusH. But we can only guess. It is utlimately up to @mariopellegrini to explain.

 

- Jan.

mariopellegrini
Quartz | Level 8

I think they are two different problems:
1) performance of your database
2) how you store your data
in this thread I'm looking at the first point

FriedEgg
SAS Employee
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
;
run;

proc ds2;

data hashjoin_dataset(overwrite=yes);
dcl package hash h ([key],[key],0,'{select upcase(compress(key)) as key from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;

quit;
mariopellegrini
Quartz | Level 8

Thank you for this excellent solution. A deepening wonder: if I wanted to add a field to join?

I would get this result:

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
run;
proc sql;
create table sqljoin_dataset as
select A.*, B.val1
from dataset_A A
join dataset_B B 
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 2859 views
  • 2 likes
  • 6 in conversation