I am trying to do the following:
1) Get a list of keys from the base data
2) Get the info from the lookup table for these keys
3) Merge the hash table w/info onto the base table
To do this, I am trying to implement a hash in place (I think this is what it is called). The problem I am running into is that this doesn't work quickly if the lookup table is very large. In this example, imagine if "have_lookup_table" was 5M records long. I'm setting that entire table just to get the lookup info for 4 records, even if those four records are the first four records of the 5M record dataset. Is there a way to get the desired result without having to set the entire table?
data have_base_table; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036 Orange, 2020-07-29, 4031 Carrot, 2020-07-21, 4110 Banana, 2020-07-31, 5191 ; data have_lookup_table; infile datalines delimiter=','; input Item $ Price 8.; datalines; Apple, 3.00 Orange, 1.00 Carrot, 0.50 Banana, 0.75 ; data want; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4. Price:8.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036, 3.00 Orange, 2020-07-29, 4031, 1.00 Carrot, 2020-07-21, 4110, 0.50 Banana, 2020-07-31, 5191, 0.75 ; run; data test; if 0 then set have_lookup_table; if _n_=1 then do; *Define empty hash table to be filled with only the necessary lookups, depending on what is in the base table; dcl hash h(); h.definekey("Item"); h.definedata("Item","Price"); h.definedone(); end; *Fill empty hash table with necessary keys from base table; do until (eof); set have_base_table end=eof; rc_ref=h.ref(); end;
h.output(dataset:"hash_keys");
*Fill hash table with corresponding key-matched data; do until (last2); set have_lookup_table end=last2; rc_replace=h.replace(); end; h.output(dataset:"hash_keys_and_data"); call missing(Price);
*Merge finished hash table onto base data; do until (last); set have_base_table end=last; rc_h=h.find(); output; end; run;
You could do this:
proc sql noprint;
select distinct cats("'",id,"'")
into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;
data want (drop=_:);
set have_base_table;
if _n_=1 then do;
if 0 then set have_lookup_table;
declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
h.definekey('id');
h.definedata(all:'Y');
h.definedone();
end;
_rc=h.find();
output;
call missing(of _all_);
run;
You could do this:
proc sql noprint;
select distinct cats("'",id,"'")
into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;
data want (drop=_:);
set have_base_table;
if _n_=1 then do;
if 0 then set have_lookup_table;
declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
h.definekey('id');
h.definedata(all:'Y');
h.definedone();
end;
_rc=h.find();
output;
call missing(of _all_);
run;
I was originally unable to use this solution because the variable list was too long to fit inside of a macro variable. However, I stumbled across this paper: https://www.lexjansen.com/sesug/2019/SESUG2019_Paper-117_Final_PDF.pdf
It contains a - *deep breath* - macro that creates a series of macro variables that are then stored inside a macro variable. The result is something like this:
*Assuming a bin size of 1 for illustration purposes only - could be much larger;
%let bin1 = Carrot;
%let bin2 = Apple;
%let bin3 = Banana;
%let bin4 = Orange;
%let master_bin = &bin1, &bin2, &bin3, &bin4;
Then, you would use the master variable in place of the id_list variable in the solution.
Thanks for the help!
Below @mkeintz solution with a few minor tweaks to the code.
proc sql noprint;
select distinct cats("'",id,"'")
into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;
data want (drop=_:);
if _n_=1 then
do;
if 0 then set have_lookup_table;
declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
h.definekey('id');
h.definedata('Price');
h.definedone();
end;
call missing(of _all_);
set have_base_table;
_rc=h.find();
run;
And just for fun here another "hash" way for doing this.
data _null_;
if _n_=1 then
do;
if 0 then set have_base_table have_lookup_table;
dcl hash h(multidata:'y', ordered:'y');
h.definekey('Item');
h.definedata('Item','Trans_Date','Code','Price');
h.definedone();
do until(done);
set have_base_table end=done;
h.add();
end;
end;
set have_lookup_table(rename=(price=_price)) end=last;
do while(h.do_over() eq 0);
price=_price;
h.replacedup();
end;
if last then h.output(dataset:'want');
run;
proc print data=want;
run;
If your lookup table is large, it is probably a lot more efficient to create an index and use that to find the records:
proc sql;
create unique index item on have_lookup_table(item);
quit;
data want notfound;
set have_base_table;
set have_lookup_table key=item/unique;
if _iorc_=0 then output want;
else output notfound;
run;
- at least if you are using this kind of lookup often.
A couple of notes:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.