BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
theponcer
Quartz | Level 8

I am trying to do the following:

 

1) Get a list of keys from the base data

2) Get the info from the lookup table for these keys

3) Merge the hash table w/info onto the base table

 

To do this, I am trying to implement a hash in place (I think this is what it is called). The problem I am running into is that this doesn't work quickly if the lookup table is very large. In this example, imagine if "have_lookup_table" was 5M records long. I'm setting that entire table just to get the lookup info for 4 records, even if those four records are the first four records of the 5M record dataset. Is there a way to get the desired result without having to set the entire table?

 


data have_base_table; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036 Orange, 2020-07-29, 4031 Carrot, 2020-07-21, 4110 Banana, 2020-07-31, 5191 ; data have_lookup_table; infile datalines delimiter=','; input Item $ Price 8.; datalines; Apple, 3.00 Orange, 1.00 Carrot, 0.50 Banana, 0.75 ; data want; infile datalines delimiter=','; input Item:$8. Trans_Date:yymmdd10. Code :$4. Price:8.; format Trans_Date yymmdd10.; datalines; Apple, 2020-07-29, 4036, 3.00 Orange, 2020-07-29, 4031, 1.00 Carrot, 2020-07-21, 4110, 0.50 Banana, 2020-07-31, 5191, 0.75 ; run; data test; if 0 then set have_lookup_table; if _n_=1 then do; *Define empty hash table to be filled with only the necessary lookups, depending on what is in the base table; dcl hash h(); h.definekey("Item"); h.definedata("Item","Price"); h.definedone(); end; *Fill empty hash table with necessary keys from base table; do until (eof); set have_base_table end=eof; rc_ref=h.ref(); end;
h.output(dataset:"hash_keys");

*Fill hash table with corresponding key-matched data; do until (last2); set have_lookup_table end=last2; rc_replace=h.replace(); end; h.output(dataset:"hash_keys_and_data"); call missing(Price);
*Merge finished hash table onto base data; do until (last); set have_base_table end=last; rc_h=h.find(); output; end; run;
1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You could do this:

 

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;

data want (drop=_:);
 set have_base_table;
 if _n_=1 then do;
   if 0 then set have_lookup_table;
   declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
     h.definekey('id');
	 h.definedata(all:'Y');
	 h.definedone();
  end;
  _rc=h.find();
  output;
  call missing(of _all_);
run;

 

  1. The PROC SQL creates a macrovar ID_LIST the looks like  'Apple','Banana','Carrot','Orange'.  Later this macro var is used an a "where=" data set name parameter to populate the hash table H with only the needed records from have_lookup_table.  A much smaller hash object.
  2. The explicit "output;" followed by a "call missing ..." statement is needed only if a lookup could fail (i.e. if an id is in have_base_table, but not in "have_lookup_table").  Such failures would leave the relevant variables with values from the most recent successful lookup.  So after each OUTPUT, the call missing prevents possible contamination of a failed lookup in the next record.

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

5 REPLIES 5
mkeintz
PROC Star

You could do this:

 

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;
%put &=id_list;

data want (drop=_:);
 set have_base_table;
 if _n_=1 then do;
   if 0 then set have_lookup_table;
   declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
     h.definekey('id');
	 h.definedata(all:'Y');
	 h.definedone();
  end;
  _rc=h.find();
  output;
  call missing(of _all_);
run;

 

  1. The PROC SQL creates a macrovar ID_LIST the looks like  'Apple','Banana','Carrot','Orange'.  Later this macro var is used an a "where=" data set name parameter to populate the hash table H with only the needed records from have_lookup_table.  A much smaller hash object.
  2. The explicit "output;" followed by a "call missing ..." statement is needed only if a lookup could fail (i.e. if an id is in have_base_table, but not in "have_lookup_table").  Such failures would leave the relevant variables with values from the most recent successful lookup.  So after each OUTPUT, the call missing prevents possible contamination of a failed lookup in the next record.

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
theponcer
Quartz | Level 8

I was originally unable to use this solution because the variable list was too long to fit inside of a macro variable. However, I stumbled across this paper: https://www.lexjansen.com/sesug/2019/SESUG2019_Paper-117_Final_PDF.pdf

 

It contains a - *deep breath* - macro that creates a series of macro variables that are then stored inside a macro variable. The result is something like this:

 

*Assuming a bin size of 1 for illustration purposes only - could be much larger;
%let bin1 = Carrot;
%let bin2 = Apple;
%let bin3 = Banana;
%let bin4 = Orange;

%let master_bin = &bin1, &bin2, &bin3, &bin4;

Then, you would use the master variable in place of the id_list variable in the solution.

 

Thanks for the help!

Patrick
Opal | Level 21

Below @mkeintz solution with a few minor tweaks to the code.

proc sql noprint;
  select distinct cats("'",id,"'")
    into :id_list separated by ',' from have_base_table;
quit;

%put &=id_list;

data want (drop=_:);
  if _n_=1 then
    do;
      if 0 then set have_lookup_table;
      declare hash h (dataset:"have_lookup_table (where=(id in (&id_list.)))");
      h.definekey('id');
      h.definedata('Price');
      h.definedone();
    end;
  call missing(of _all_);

  set have_base_table;
  _rc=h.find();
run;
Patrick
Opal | Level 21

And just for fun here another "hash" way for doing this.

data _null_;

  if _n_=1 then
    do;
      if 0 then set have_base_table have_lookup_table;
      dcl hash h(multidata:'y', ordered:'y');
      h.definekey('Item');
      h.definedata('Item','Trans_Date','Code','Price');
      h.definedone();
      do until(done);
        set have_base_table end=done;
        h.add();
      end;
    end;
  
  set have_lookup_table(rename=(price=_price)) end=last;

  do while(h.do_over() eq 0);
    price=_price;
    h.replacedup();
  end;

  if last then h.output(dataset:'want');
run;

proc print data=want;
run;
s_lassen
Meteorite | Level 14

If your lookup table is large, it is probably a lot more efficient to create an index and use that to find the records:

proc sql;
  create unique index item on have_lookup_table(item);
quit;

data want notfound;
  set have_base_table;
  set have_lookup_table key=item/unique;
  if _iorc_=0 then output want;
  else output notfound;
run;

- at least if you are using this kind of lookup often.

 

A couple of notes:

  • I created a unique index on the lookup table, meaning that duplicate values of the ITEM variable are not allowed
  • I used the UNIQUE option for the SET with KEY=, so that you can find values for an item also when there are repeats in the base table
  • I put in a bit of code to account for the situations where the item on the base table is not in the lookup table, using the _IORC_ automatic variable - these will be output to the table NOTFOUND in the example.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 836 views
  • 4 likes
  • 4 in conversation