I would like to do a join using sas hash object, and get the same result as 4 lines of proc sql follows:
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
run;
proc sql;
create table sqljoin_dataset as
select A.*
from dataset_A A
join dataset_B B
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;
As in the clause "ON" are present the functions "compress" and “upcase", the following hash procedure does not allow to obtain the same result (1 row rather than 4):
data hashjoin_dataset ;
set dataset_A;key=compress(upcase(key));
if _N_ = 1 then do;
declare hash T(dataset: 'dataset_B');
T.definekey(compress(upcase('key')));
T.definedata (all:'yes');
T.definedone();
key=compress(upcase(key));
end;
key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find();
run;
Does anyone know if it is possible, through a single data step, to obtain with hash object the same result of proc sql? Thank you
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
;
run;
proc ds2;
data hashjoin_dataset(overwrite=yes);
dcl double val1;
dcl package hash h ([key],[key,val1],0,'{select upcase(compress(key)) as key,val1 from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;
quit;
May I ask what requirement drives this inquiry?
Just for learning and fun?
Since I can't see any other reason to switch from the simplicity of SQL unless you have severe performance problems.
* modify key for hash below with view;
data v_dataset_B/view=v_dataset_B;
set dataset_B;
key=compress(upcase(key));
run;
data hashjoin_dataset ;
if 0 then set v_dataset_B;
if _N_ = 1 then do;
declare hash T(dataset: 'v_dataset_B');
T.definekey('key');
T.definedata (all:'yes');
T.definedone();
end;
set dataset_A;
key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find();
run;
Assuming dataset_B is the smaller I would preprocess it and replace key with conpress(uocase(key)). This way it gets the proper value in the has table without issue.
Then find() will find it as you do the same with key from dataset_A.
data dataset_b2;
set dataset_b;
key=compress(upcase(key));
run;
data hashjoin_dataset ;
set dataset_A;key=compress(upcase(key));
if _N_ = 1 then do;
declare hash T(dataset: 'dataset_B');
T.definekey(compress(upcase('key')));
T.definedata (all:'yes');
T.definedone();
/* key=compress(upcase(key)); */
end;
key=compress(upcase(key));
if T.find() = 0; /* join */
drop rc;
rc = t.find();
run;
The compress(upcase()) doesn't do anything when loading the hash table so you can take that out.
Hope this helps,
- Jan.
** edited for completenesss of code **
Thanks, but I have the need to use only a data step
@mario_pellegrini wrote:
Thanks, but I have the need to use only a data step
I am curious as to why that need exists. Can you explain? People are coding for you and this way you present a moving target.
Regards Jan
Why do you have to use a single data step? Does FriedEgg's code make some sense ? data dataset_A; input key:$4.; datalines; AAA BBB CCC DDD EEe ; run; data dataset_B; input key:$4.; infile datalines dlm = ","; datalines; AAa B BB CcC DDD ; run; data want; if _n_=1 then do; if 0 then set dataset_B; declare hash h(); h.definekey('key'); h.definedone(); do until(last); set dataset_B end=last; key=compress(upcase(key)); h.replace(); end; end; set dataset_B; key=compress(upcase(key)); if h.check()=0; run;
But it is never used dataset A
@mario_pellegrin wrote:
But it is never used dataset A
That's undoubtedly a typo.
@LinusH wrote:
In my experience, things seldom "has" to be in a certain way. Most resolutions are compromises. If you are worried about disk rewrite in the extra data step, just define it as a view. Should pose very little overhead if you got memory enough.
If you worry about the performance of your database, you should concern more about how you store your data. Like having join keys that never needs a function call to work. That should have been taken care of during data load.
I totally agree @LinusH. But we can only guess. It is utlimately up to @mariopellegrini to explain.
- Jan.
I think they are two different problems:
1) performance of your database
2) how you store your data
in this thread I'm looking at the first point
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
;
run;
data dataset_B;
input key:$4.;
infile datalines dlm = ",";
datalines;
AAa
B BB
CcC
DDD
;
run;
proc ds2;
data hashjoin_dataset(overwrite=yes);
dcl package hash h ([key],[key],0,'{select upcase(compress(key)) as key from dataset_b}');
method run();
set dataset_A;
key = upcase(compress(key));
if h.find()=0 then output;
end;
enddata;
run;
quit;
Thank you for this excellent solution. A deepening wonder: if I wanted to add a field to join?
I would get this result:
data dataset_A;
input key:$4.;
datalines;
AAA
BBB
CCC
DDD
run;
data dataset_B;
input key:$4. val1;
infile datalines dlm = ",";
datalines;
AAa,10
B BB,20
CcC,30
run;
proc sql;
create table sqljoin_dataset as
select A.*, B.val1
from dataset_A A
join dataset_B B
on compress(upcase(A.key)) = compress(upcase(B.key));
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.