BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
theponcer
Quartz | Level 8

I am trying to do the following:

 

1) Get a list of keys from the base data

2) Get the info from the lookup table for these keys

3) Merge the hash table w/info onto the base table

 

To do this, I am trying to implement a hash in place (I think this is what it is called). The problem I am running into is that this doesn't work quickly if the lookup table is very large. In this example, imagine if "have_lookup_table" was 5M records long. I'm setting that entire table just to get the lookup info for 4 records, even if those four records are the first four records of the 5M record dataset. Is there a way to get the desired result without having to set the entire table?

 


data have_base_table; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036 Orange, 2020-07-29, 4031 Carrot, 2020-07-21, 4110 Banana, 2020-07-31, 5191 ; data have_lookup_table; infile datalines delimiter=','; input Item $ Price 8.; datalines; Apple, 3.00 Orange, 1.00 Carrot, 0.50 Banana, 0.75 ; data want; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4. Price:8.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036, 3.00 Orange, 2020-07-29, 4031, 1.00 Carrot, 2020-07-21, 4110, 0.50 Banana, 2020-07-31, 5191, 0.75 ; run; data test; if 0 then set have_lookup_table; if _n_=1 then do; *Define empty hash table to be filled with only the necessary lookups, depending on what is in the base table; dcl hash h(); h.definekey("Item"); h.definedata("Item","Price"); h.definedone(); end; *Fill empty hash table with necessary keys from base table; do until (eof); set have_base_table end=eof; rc_ref=h.ref(); end;
h.output(dataset:"hash_keys");

*Fill hash table with corresponding key-matched data; do until (last2); set have_lookup_table end=last2; rc_replace=h.replace(); end; h.output(dataset:"hash_keys_and_data"); call missing(Price);
*Merge finished hash table onto base data; do until (last); set have_base_table end=last; rc_h=h.find(); output; end; run;
1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You could do this:

 

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;

data want (drop=_:);
 set have_base_table;
 if _n_=1 then do;
   if 0 then set have_lookup_table;
   declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
     h.definekey('id');
	 h.definedata(all:'Y');
	 h.definedone();
  end;
  _rc=h.find();
  output;
  call missing(of _all_);
run;

 

  1. The PROC SQL creates a macrovar ID_LIST the looks like  'Apple','Banana','Carrot','Orange'.  Later this macro var is used an a "where=" data set name parameter to populate the hash table H with only the needed records from have_lookup_table.  A much smaller hash object.
  2. The explicit "output;" followed by a "call missing ..." statement is needed only if a lookup could fail (i.e. if an id is in have_base_table, but not in "have_lookup_table").  Such failures would leave the relevant variables with values from the most recent successful lookup.  So after each OUTPUT, the call missing prevents possible contamination of a failed lookup in the next record.

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

5 REPLIES 5
mkeintz
PROC Star

You could do this:

 

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;

data want (drop=_:);
 set have_base_table;
 if _n_=1 then do;
   if 0 then set have_lookup_table;
   declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
     h.definekey('id');
	 h.definedata(all:'Y');
	 h.definedone();
  end;
  _rc=h.find();
  output;
  call missing(of _all_);
run;

 

  1. The PROC SQL creates a macrovar ID_LIST the looks like  'Apple','Banana','Carrot','Orange'.  Later this macro var is used an a "where=" data set name parameter to populate the hash table H with only the needed records from have_lookup_table.  A much smaller hash object.
  2. The explicit "output;" followed by a "call missing ..." statement is needed only if a lookup could fail (i.e. if an id is in have_base_table, but not in "have_lookup_table").  Such failures would leave the relevant variables with values from the most recent successful lookup.  So after each OUTPUT, the call missing prevents possible contamination of a failed lookup in the next record.

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
theponcer
Quartz | Level 8

I was originally unable to use this solution because the variable list was too long to fit inside of a macro variable. However, I stumbled across this paper: https://www.lexjansen.com/sesug/2019/SESUG2019_Paper-117_Final_PDF.pdf

 

It contains a - *deep breath* - macro that creates a series of macro variables that are then stored inside a macro variable. The result is something like this:

 

*Assuming a bin size of 1 for illustration purposes only - could be much larger;
%let bin1 = Carrot;
%let bin2 = Apple;
%let bin3 = Banana;
%let bin4 = Orange;

%let master_bin = &bin1, &bin2, &bin3, &bin4;

Then, you would use the master variable in place of the id_list variable in the solution.

 

Thanks for the help!

Patrick
Opal | Level 21

Below @mkeintz solution with a few minor tweaks to the code.

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;

%put &=id_list;

data want (drop=_:);
  if _n_=1 then
    do;
      if 0 then set have_lookup_table;
      declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
      h.definekey('id');
      h.definedata('Price');
      h.definedone();
    end;
  call missing(of _all_);

  set have_base_table;
  _rc=h.find();
run;
Patrick
Opal | Level 21

And just for fun here another "hash" way for doing this.

data _null_;

  if _n_=1 then
    do;
      if 0 then set have_base_table have_lookup_table;
      dcl hash h(multidata:'y', ordered:'y');
      h.definekey('Item');
      h.definedata('Item','Trans_Date','Code','Price');
      h.definedone();
      do until(done);
        set have_base_table end=done;
        h.add();
      end;
    end;
  
  set have_lookup_table(rename=(price=_price)) end=last;

  do while(h.do_over() eq 0);
    price=_price;
    h.replacedup();
  end;

  if last then h.output(dataset:'want');
run;

proc print data=want;
run;
s_lassen
Meteorite | Level 14

If your lookup table is large, it is probably a lot more efficient to create an index and use that to find the records:

proc sql;
  create unique index item on have_lookup_table(item);
quit;

data want notfound;
  set have_base_table;
  set have_lookup_table key=item/unique;
  if _iorc_=0 then output want;
  else output notfound;
run;

- at least if you are using this kind of lookup often.

 

A couple of notes:

  • I created a unique index on the lookup table, meaning that duplicate values of the ITEM variable are not allowed
  • I used the UNIQUE option for the SET with KEY=, so that you can find values for an item also when there are repeats in the base table
  • I put in a bit of code to account for the situations where the item on the base table is not in the lookup table, using the _IORC_ automatic variable - these will be output to the table NOTFOUND in the example.

sas-innovate-white.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1477 views
  • 4 likes
  • 4 in conversation