I am trying to do the following:
1) Get a list of keys from the base data
2) Get the info from the lookup table for these keys
3) Merge the hash table w/info onto the base table
To do this, I am trying to implement a hash in place (I think this is what it is called). The problem I am running into is that this doesn't work quickly if the lookup table is very large. In this example, imagine if "have_lookup_table" was 5M records long. I'm setting that entire table just to get the lookup info for 4 records, even if those four records are the first four records of the 5M record dataset. Is there a way to get the desired result without having to set the entire table?
data have_base_table; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036 Orange, 2020-07-29, 4031 Carrot, 2020-07-21, 4110 Banana, 2020-07-31, 5191 ; data have_lookup_table; infile datalines delimiter=','; input Item $ Price 8.; datalines; Apple, 3.00 Orange, 1.00 Carrot, 0.50 Banana, 0.75 ; data want; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4. Price:8.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036, 3.00 Orange, 2020-07-29, 4031, 1.00 Carrot, 2020-07-21, 4110, 0.50 Banana, 2020-07-31, 5191, 0.75 ; run; data test; if 0 then set have_lookup_table; if _n_=1 then do; *Define empty hash table to be filled with only the necessary lookups, depending on what is in the base table; dcl hash h(); h.definekey("Item"); h.definedata("Item","Price"); h.definedone(); end; *Fill empty hash table with necessary keys from base table; do until (eof); set have_base_table end=eof; rc_ref=h.ref(); end;
h.output(dataset:"hash_keys");
*Fill hash table with corresponding key-matched data; do until (last2); set have_lookup_table end=last2; rc_replace=h.replace(); end; h.output(dataset:"hash_keys_and_data"); call missing(Price);
*Merge finished hash table onto base data; do until (last); set have_base_table end=last; rc_h=h.find(); output; end; run;
You could do this:
proc sql noprint;
select distinct cats("'",id,"'")
into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;
data want (drop=_:);
set have_base_table;
if _n_=1 then do;
if 0 then set have_lookup_table;
declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
h.definekey('id');
h.definedata(all:'Y');
h.definedone();
end;
_rc=h.find();
output;
call missing(of _all_);
run;
You could do this:
proc sql noprint;
select distinct cats("'",id,"'")
into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;
data want (drop=_:);
set have_base_table;
if _n_=1 then do;
if 0 then set have_lookup_table;
declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
h.definekey('id');
h.definedata(all:'Y');
h.definedone();
end;
_rc=h.find();
output;
call missing(of _all_);
run;
I was originally unable to use this solution because the variable list was too long to fit inside of a macro variable. However, I stumbled across this paper: https://www.lexjansen.com/sesug/2019/SESUG2019_Paper-117_Final_PDF.pdf
It contains a - *deep breath* - macro that creates a series of macro variables that are then stored inside a macro variable. The result is something like this:
*Assuming a bin size of 1 for illustration purposes only - could be much larger;
%let bin1 = Carrot;
%let bin2 = Apple;
%let bin3 = Banana;
%let bin4 = Orange;
%let master_bin = &bin1, &bin2, &bin3, &bin4;
Then, you would use the master variable in place of the id_list variable in the solution.
Thanks for the help!
Below @mkeintz solution with a few minor tweaks to the code.
proc sql noprint;
select distinct cats("'",id,"'")
into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;
data want (drop=_:);
if _n_=1 then
do;
if 0 then set have_lookup_table;
declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
h.definekey('id');
h.definedata('Price');
h.definedone();
end;
call missing(of _all_);
set have_base_table;
_rc=h.find();
run;
And just for fun here another "hash" way for doing this.
data _null_;
if _n_=1 then
do;
if 0 then set have_base_table have_lookup_table;
dcl hash h(multidata:'y', ordered:'y');
h.definekey('Item');
h.definedata('Item','Trans_Date','Code','Price');
h.definedone();
do until(done);
set have_base_table end=done;
h.add();
end;
end;
set have_lookup_table(rename=(price=_price)) end=last;
do while(h.do_over() eq 0);
price=_price;
h.replacedup();
end;
if last then h.output(dataset:'want');
run;
proc print data=want;
run;
If your lookup table is large, it is probably a lot more efficient to create an index and use that to find the records:
proc sql;
create unique index item on have_lookup_table(item);
quit;
data want notfound;
set have_base_table;
set have_lookup_table key=item/unique;
if _iorc_=0 then output want;
else output notfound;
run;
- at least if you are using this kind of lookup often.
A couple of notes:
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.