Solved: Re: Hash join with compress(upcase) function

mariopellegrini · Posted 04-08-2016 01:52 PM

I would like to do a join using sas hash object, and get the same result as 4 lines of proc sql follows:

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
run;
proc sql;
create table sqljoin_dataset as
select A.*
from dataset_A A
join dataset_B B 
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;

As in the clause "ON" are present the functions "compress" and “upcase", the following hash procedure does not allow to obtain the same result (1 row rather than 4):

data  hashjoin_dataset ;
set dataset_A;key=compress(upcase(key));
if  _N_ = 1 then do;
	declare hash T(dataset: 'dataset_B');
		    T.definekey(compress(upcase('key')));
		    T.definedata (all:'yes');
		    T.definedone();
			 key=compress(upcase(key));
	end;
	key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find(); 
run;

Does anyone know if it is possible, through a single data step, to obtain with hash object the same result of proc sql? Thank you

FriedEgg · Posted 04-14-2016 10:53 AM

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
;
run;

proc ds2;

data hashjoin_dataset(overwrite=yes);
dcl double val1;
dcl package hash h ([key],[key,val1],0,'{select upcase(compress(key)) as key,val1 from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;

quit;

View solution in original post

LinusH · Posted 04-08-2016 02:13 PM

May I ask what requirement drives this inquiry?

Just for learning and fun?

Since I can't see any other reason to switch from the simplicity of SQL unless you have severe performance problems.

Data never sleeps

mariopellegrini · Posted 04-08-2016 05:40 PM

The reason is of course the improvement in processing time on large datasets

FriedEgg · Posted 04-08-2016 06:01 PM

* modify key for hash below with view;

data v_dataset_B/view=v_dataset_B;
  set dataset_B;
  key=compress(upcase(key));
run;

data  hashjoin_dataset ;
if 0 then set v_dataset_B;
if _N_ = 1 then do; 
  declare hash T(dataset: 'v_dataset_B'); 
    T.definekey('key'); 
    T.definedata (all:'yes'); 
    T.definedone(); 
end; 
set dataset_A;
key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find(); 
run;

jklaverstijn · Posted 04-08-2016 05:58 PM

Assuming dataset_B is the smaller I would preprocess it and replace key with conpress(uocase(key)). This way it gets the proper value in the has table without issue.

Then find() will find it as you do the same with key from dataset_A.

data dataset_b2;
   set dataset_b;
   key=compress(upcase(key));
run;
data  hashjoin_dataset ;
set dataset_A;key=compress(upcase(key));
if  _N_ = 1 then do;
	declare hash T(dataset: 'dataset_B');
		    T.definekey(compress(upcase('key')));
		    T.definedata (all:'yes');
		    T.definedone();
		/*	 key=compress(upcase(key)); */
	end;
	key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find(); 
run;

The compress(upcase()) doesn't do anything when loading the hash table so you can take that out.

Hope this helps,

- Jan.

** edited for completenesss of code **

mariopellegrini · Posted 04-09-2016 03:13 AM

Thanks, but I have the need to use only a data step

jklaverstijn · Posted 04-09-2016 04:22 AM

@mario_pellegrini wrote:

Thanks, but I have the need to use only a data step

I am curious as to why that need exists. Can you explain? People are coding for you and this way you present a moving target.

Regards Jan

Ksharp · Posted 04-08-2016 11:57 PM

Why do you have to use a single data step? Does FriedEgg's code make some sense ?


data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
EEe
;
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
;
run;
data want;
 if _n_=1 then do;
  if 0 then set dataset_B;
  declare hash h();
  h.definekey('key');
  h.definedone();
  
  do until(last);
   set dataset_B end=last;
   key=compress(upcase(key));
   h.replace();
  end;
 
 end;
set dataset_B;
key=compress(upcase(key));
if h.check()=0;
run;

mario_pellegrin · Posted 04-09-2016 03:03 AM

But it is never used dataset A

jklaverstijn · Posted 04-09-2016 04:18 AM

@mario_pellegrin wrote:

But it is never used dataset A

That's undoubtedly a typo.

LinusH · Posted 04-09-2016 05:54 AM

In my experience, things seldom "has" to be in a certain way. Most resolutions are compromises. If you are worried about disk rewrite in the extra data step, just define it as a view. Should pose very little overhead if you got memory enough.

If you worry about the performance of your database, you should concern more about how you store your data. Like having join keys that never needs a function call to work. That should have been taken care of during data load.

Data never sleeps

jklaverstijn · Posted 04-09-2016 06:43 AM

@LinusH wrote:
In my experience, things seldom "has" to be in a certain way. Most resolutions are compromises. If you are worried about disk rewrite in the extra data step, just define it as a view. Should pose very little overhead if you got memory enough.

If you worry about the performance of your database, you should concern more about how you store your data. Like having join keys that never needs a function call to work. That should have been taken care of during data load.

I totally agree @LinusH. But we can only guess. It is utlimately up to @mariopellegrini to explain.

- Jan.

mariopellegrini · Posted 04-14-2016 09:35 AM

I think they are two different problems:
1) performance of your database
2) how you store your data
in this thread I'm looking at the first point

FriedEgg · Posted 04-11-2016 04:13 PM

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
;
run;

proc ds2;

data hashjoin_dataset(overwrite=yes);
dcl package hash h ([key],[key],0,'{select upcase(compress(key)) as key from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;

quit;

mariopellegrini · Posted 04-14-2016 10:45 AM

Thank you for this excellent solution. A deepening wonder: if I wanted to add a field to join?

I would get this result:

data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
run;
proc sql;
create table sqljoin_dataset as
select A.*, B.val1
from dataset_A A
join dataset_B B 
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;

SAS Innovate 2025: Call for Content

Classroom Training Available!