DATA Step, Macro, Functions and more

Hash join with compress(upcase) function

Accepted Solution Solved
Reply
Contributor
Posts: 37
Accepted Solution

Hash join with compress(upcase) function

I would like to do a join using sas hash object, and get the same result as 4 lines of proc sql follows:

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
run;
proc sql;
create table sqljoin_dataset as
select A.*
from dataset_A A
join dataset_B B 
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;

As in the clause "ON" are present the functions "compress" and “upcase", the following hash procedure does not allow to obtain the same result (1 row rather than 4):

 

data  hashjoin_dataset ;
set dataset_A;key=compress(upcase(key));
if  _N_ = 1 then do;
	declare hash T(dataset: 'dataset_B');
		    T.definekey(compress(upcase('key')));
		    T.definedata (all:'yes');
		    T.definedone();
			 key=compress(upcase(key));
	end;
	key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find(); 
run;

Does anyone know if it is possible, through a single data step, to obtain with hash object the same result of proc sql? Thank you


Accepted Solutions
Solution
‎04-14-2016 12:00 PM
Trusted Advisor
Posts: 1,301

Re: Hash join with compress(upcase) function

Posted in reply to mario_pellegrini
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
;
run;

proc ds2;

data hashjoin_dataset(overwrite=yes);
dcl double val1;
dcl package hash h ([key],[key,val1],0,'{select upcase(compress(key)) as key,val1 from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;

quit;

View solution in original post


All Replies
Super User
Posts: 5,434

Re: Hash join with compress(upcase) function

Posted in reply to mario_pellegrini

May I ask what requirement drives this inquiry?

Just for learning and fun? Smiley Very Happy

 

Since I can't see any other reason to switch from the simplicity of SQL unless you have severe performance problems.

Data never sleeps
Contributor
Posts: 37

Re: Hash join with compress(upcase) function

Posted in reply to mario_pellegrini
The reason is of course the improvement in processing time on large datasets
Trusted Advisor
Posts: 1,301

Re: Hash join with compress(upcase) function

[ Edited ]
Posted in reply to mario_pellegrini
* modify key for hash below with view;

data v_dataset_B/view=v_dataset_B;
  set dataset_B;
  key=compress(upcase(key));
run;

data  hashjoin_dataset ;
if 0 then set v_dataset_B;
if _N_ = 1 then do; 
  declare hash T(dataset: 'v_dataset_B'); 
    T.definekey('key'); 
    T.definedata (all:'yes'); 
    T.definedone(); 
end; 
set dataset_A;
key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find(); 
run;
Super Contributor
Posts: 441

Re: Hash join with compress(upcase) function

[ Edited ]
Posted in reply to mario_pellegrini

Assuming dataset_B is the smaller I would preprocess it and replace key with conpress(uocase(key)). This way it gets the proper value in the has table without issue.

 

Then find() will find it as you do the same with key from dataset_A.

data dataset_b2;
set dataset_b;
key=compress(upcase(key));
run;
data hashjoin_dataset ; set dataset_A;key=compress(upcase(key)); if _N_ = 1 then do; declare hash T(dataset: 'dataset_B'); T.definekey(compress(upcase('key'))); T.definedata (all:'yes'); T.definedone(); /* key=compress(upcase(key)); */ end; key=compress(upcase(key)); if T.find() = 0; /* join */ drop rc; rc = t.find(); run;

The compress(upcase()) doesn't do anything when loading the hash table so you can take that out.

 

Hope this helps,

- Jan.

 

** edited for completenesss of code **

 

Contributor
Posts: 37

Re: Hash join with compress(upcase) function

Posted in reply to jklaverstijn

Thanks, but I have the need to use only a data step

Super Contributor
Posts: 441

Re: Hash join with compress(upcase) function

Posted in reply to mario_pellegrini

mario_pellegrini wrote:

Thanks, but I have the need to use only a data step


I am curious as to why that need exists. Can you explain? People are coding for you and this way you present a moving target.

 

Regards Jan

Super User
Posts: 10,041

Re: Hash join with compress(upcase) function

Posted in reply to mario_pellegrini
Why do you have to use a single data step? Does FriedEgg's code make some sense ?


data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
EEe
;
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
;
run;
data want;
 if _n_=1 then do;
  if 0 then set dataset_B;
  declare hash h();
  h.definekey('key');
  h.definedone();
  
  do until(last);
   set dataset_B end=last;
   key=compress(upcase(key));
   h.replace();
  end;
 
 end;
set dataset_B;
key=compress(upcase(key));
if h.check()=0;
run;


Senior User
Posts: 1

Re: Hash join with compress(upcase) function

But it is never used dataset A

Super Contributor
Posts: 441

Re: Hash join with compress(upcase) function

[ Edited ]
Posted in reply to mario_pellegrin

mario_pellegrin wrote:

But it is never used dataset A


That's undoubtedly a typo.

Super User
Posts: 5,434

Re: Hash join with compress(upcase) function

Posted in reply to mario_pellegrini
In my experience, things seldom "has" to be in a certain way. Most resolutions are compromises. If you are worried about disk rewrite in the extra data step, just define it as a view. Should pose very little overhead if you got memory enough.

If you worry about the performance of your database, you should concern more about how you store your data. Like having join keys that never needs a function call to work. That should have been taken care of during data load.
Data never sleeps
Super Contributor
Posts: 441

Re: Hash join with compress(upcase) function


LinusH wrote:
In my experience, things seldom "has" to be in a certain way. Most resolutions are compromises. If you are worried about disk rewrite in the extra data step, just define it as a view. Should pose very little overhead if you got memory enough.

If you worry about the performance of your database, you should concern more about how you store your data. Like having join keys that never needs a function call to work. That should have been taken care of during data load.

I totally agree @LinusH. But we can only guess. It is utlimately up to @mario_pellegrini to explain.

 

- Jan.

Contributor
Posts: 37

Re: Hash join with compress(upcase) function

Posted in reply to jklaverstijn

I think they are two different problems:
1) performance of your database
2) how you store your data
in this thread I'm looking at the first point

Trusted Advisor
Posts: 1,301

Re: Hash join with compress(upcase) function

Posted in reply to mario_pellegrini
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
;
run;

proc ds2;

data hashjoin_dataset(overwrite=yes);
dcl package hash h ([key],[key],0,'{select upcase(compress(key)) as key from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;

quit;
Contributor
Posts: 37

Re: Hash join with compress(upcase) function

Thank you for this excellent solution. A deepening wonder: if I wanted to add a field to join?

I would get this result:

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
run;
proc sql;
create table sqljoin_dataset as
select A.*, B.val1
from dataset_A A
join dataset_B B 
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 523 views
  • 2 likes
  • 6 in conversation