Solved: Re: Hash In Place

theponcer · Posted 07-31-2020 02:17 PM

I am trying to do the following:

1) Get a list of keys from the base data

2) Get the info from the lookup table for these keys

3) Merge the hash table w/info onto the base table

To do this, I am trying to implement a hash in place (I think this is what it is called). The problem I am running into is that this doesn't work quickly if the lookup table is very large. In this example, imagine if "have_lookup_table" was 5M records long. I'm setting that entire table just to get the lookup info for 4 records, even if those four records are the first four records of the 5M record dataset. Is there a way to get the desired result without having to set the entire table?


data have_base_table;
   infile datalines delimiter=','; 
   input Item:$8. Trans_Date:yymmdd10. Code :$4.;
   format Trans_Date yymmdd10.;
   datalines;                      
Apple, 2020-07-29, 4036
Orange, 2020-07-29, 4031
Carrot, 2020-07-21, 4110
Banana, 2020-07-31, 5191
;

data have_lookup_table;
   infile datalines delimiter=','; 
   input Item $ Price 8.;
   datalines;                      
Apple, 3.00
Orange, 1.00
Carrot, 0.50
Banana, 0.75
;

data want;
   infile datalines delimiter=','; 
   input Item:$8. Trans_Date:yymmdd10. Code :$4. Price:8.;
   format Trans_Date yymmdd10.;
   datalines;                      
Apple, 2020-07-29, 4036, 3.00
Orange, 2020-07-29, 4031, 1.00
Carrot, 2020-07-21, 4110, 0.50
Banana, 2020-07-31, 5191, 0.75
;
run;

data test;

if 0 then set have_lookup_table;
if _n_=1 then do;

*Define empty hash table to be filled with only the necessary lookups, depending on what is in the base table;
dcl hash h();
h.definekey("Item");
h.definedata("Item","Price");
h.definedone();

end;

*Fill empty hash table with necessary keys from base table;
do until (eof);
set have_base_table end=eof;
rc_ref=h.ref();
end;

h.output(dataset:"hash_keys");

*Fill hash table with corresponding key-matched data;
do until (last2);
set have_lookup_table end=last2;
rc_replace=h.replace();
end;

h.output(dataset:"hash_keys_and_data");

call missing(Price);

*Merge finished hash table onto base data;
do until (last);
set have_base_table end=last;
rc_h=h.find();
output;
end;

run;

mkeintz · Posted 07-31-2020 10:25 PM

You could do this:

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;

data want (drop=_:);
 set have_base_table;
 if _n_=1 then do;
   if 0 then set have_lookup_table;
   declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
     h.definekey('id');
	 h.definedata(all:'Y');
	 h.definedone();
  end;
  _rc=h.find();
  output;
  call missing(of _all_);
run;

The PROC SQL creates a macrovar ID_LIST the looks like 'Apple','Banana','Carrot','Orange'. Later this macro var is used an a "where=" data set name parameter to populate the hash table H with only the needed records from have_lookup_table. A much smaller hash object.
The explicit "output;" followed by a "call missing ..." statement is needed only if a lookup could fail (i.e. if an id is in have_base_table, but not in "have_lookup_table"). Such failures would leave the relevant variables with values from the most recent successful lookup. So after each OUTPUT, the call missing prevents possible contamination of a failed lookup in the next record.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 07-31-2020 10:25 PM

You could do this:

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;

data want (drop=_:);
 set have_base_table;
 if _n_=1 then do;
   if 0 then set have_lookup_table;
   declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
     h.definekey('id');
	 h.definedata(all:'Y');
	 h.definedone();
  end;
  _rc=h.find();
  output;
  call missing(of _all_);
run;

The PROC SQL creates a macrovar ID_LIST the looks like 'Apple','Banana','Carrot','Orange'. Later this macro var is used an a "where=" data set name parameter to populate the hash table H with only the needed records from have_lookup_table. A much smaller hash object.
The explicit "output;" followed by a "call missing ..." statement is needed only if a lookup could fail (i.e. if an id is in have_base_table, but not in "have_lookup_table"). Such failures would leave the relevant variables with values from the most recent successful lookup. So after each OUTPUT, the call missing prevents possible contamination of a failed lookup in the next record.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

theponcer · Posted 08-05-2020 01:45 PM

I was originally unable to use this solution because the variable list was too long to fit inside of a macro variable. However, I stumbled across this paper: https://www.lexjansen.com/sesug/2019/SESUG2019_Paper-117_Final_PDF.pdf

It contains a - *deep breath* - macro that creates a series of macro variables that are then stored inside a macro variable. The result is something like this:

*Assuming a bin size of 1 for illustration purposes only - could be much larger;
%let bin1 = Carrot;
%let bin2 = Apple;
%let bin3 = Banana;
%let bin4 = Orange;

%let master_bin = &bin1, &bin2, &bin3, &bin4;

Then, you would use the master variable in place of the id_list variable in the solution.

Thanks for the help!

Patrick · Posted 07-31-2020 11:12 PM

Below @mkeintz solution with a few minor tweaks to the code.

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;

%put &=id_list;

data want (drop=_:);
  if _n_=1 then
    do;
      if 0 then set have_lookup_table;
      declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
      h.definekey('id');
      h.definedata('Price');
      h.definedone();
    end;
  call missing(of _all_);

  set have_base_table;
  _rc=h.find();
run;

Patrick · Posted 07-31-2020 11:51 PM

And just for fun here another "hash" way for doing this.

data _null_;

  if _n_=1 then
    do;
      if 0 then set have_base_table have_lookup_table;
      dcl hash h(multidata:'y', ordered:'y');
      h.definekey('Item');
      h.definedata('Item','Trans_Date','Code','Price');
      h.definedone();
      do until(done);
        set have_base_table end=done;
        h.add();
      end;
    end;
  
  set have_lookup_table(rename=(price=_price)) end=last;

  do while(h.do_over() eq 0);
    price=_price;
    h.replacedup();
  end;

  if last then h.output(dataset:'want');
run;

proc print data=want;
run;

s_lassen · Posted 08-01-2020 10:55 AM

If your lookup table is large, it is probably a lot more efficient to create an index and use that to find the records:

proc sql;
  create unique index item on have_lookup_table(item);
quit;

data want notfound;
  set have_base_table;
  set have_lookup_table key=item/unique;
  if _iorc_=0 then output want;
  else output notfound;
run;

- at least if you are using this kind of lookup often.

A couple of notes:

I created a unique index on the lookup table, meaning that duplicate values of the ITEM variable are not allowed
I used the UNIQUE option for the SET with KEY=, so that you can find values for an item also when there are repeats in the base table
I put in a bit of code to account for the situations where the item on the base table is not in the lookup table, using the _IORC_ automatic variable - these will be output to the table NOTFOUND in the example.

Classroom Training Available!