topic using hash to select observations in New SAS User

using hash to select observations

changxuosu — Mon, 06 May 2019 20:10:28 GMT

Hi SAS communities,

I've heard the power of "in-memory" table look-up presented by hash tables.

Having no backgrounds in programming, I still didn't understand the advantage of hash tables after reading a bunch of sas papers online about hash table.

Now I come across a problem in SAS coding and just wondering whether this problem can be solved by hash tables.

I'm soliciting general ideas, so no detailed sas code and data sets are available for me to provide. I have no idea how to write hash table code on this problem. I'm just curious whether this problem can be solved by hash tables.

Here is the problem:

I have two datasets A and B. Both datasets have the same variables, for example, acct and revenue. The only difference is that dataset A has 1milion observations and dataset B has 1milllion plus 20 observations that are not in dataset A.

I want to find who these 20 additional observations are in dataset B that can not be found in dataset A.

I write a proc sql code like the following to approach this problem:

proc sql;

select *

from datasetB

where acct_key not in ( select distinct acct_key from datasetA);

quit;

The problem is that, with 1million and 1million+20 observations in each dataset, this piece of code takes insane amount of time for SAS to finish the job. My sas even crushes once and runs forever. Wondering how Hash table can help in this case?? How can the magical "in-memory table look up" function possessed by hash table help me to cut down the running time and do the job more efficiently?

Thanks in advance to all dear sas community members. I'm indebted to your thoughts/inputs/ideas/suggestions.

Re: using hash to select observations

PeterClemmensen — Tue, 07 May 2019 06:55:23 GMT

Sounds like a nice job for a hash object. As you mention, we can't see your data, so needless to say, this code is untested but see if this gives you what you want

data want;
   if 0 then set A;
   if _N_ = 1 then do;
      declare hash h(dataset:'A');
      h.defineKey(acct_key);
      h.defineDone();
   end;

   set B;

   if h.check() ne 0;
run;

Before explaining this 'magical' in-memory technique, let me know if this works for you 🙂 An absolute requirement for it to work is that the data in the hash object fits in memory.

Re: using hash to select observations

PeterClemmensen — Wed, 08 May 2019 13:28:28 GMT

@changxuosu, did the above solve your problem? 🙂

Re: using hash to select observations

changxuosu — Fri, 17 May 2019 15:08:09 GMT

@PeterClemmensen I'm sorry it took a while to get back to you. I was kept from unexpected deadlines. Thank you so much for your suggestion. I tried it and still trying to understand it.
I met the following error when tried it, so that you know.

ERROR: Undeclared key symbol for hash object at line 31 column 7.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.

Re: using hash to select observations

PeterClemmensen — Sat, 18 May 2019 08:58:05 GMT

Sorry, the key variable should be in quotation marks like this. How about that?

data want;
   if 0 then set A;
   if _N_ = 1 then do;
      declare hash h(dataset:'A');
      h.defineKey('acct_key');
      h.defineDone();
   end;

   set B;

   if h.check() ne 0;
run;

Re: using hash to select observations

changxuosu — Mon, 20 May 2019 16:05:56 GMT

thank you so much for your followup. It worked greatly 🙂

Re: using hash to select observations

PeterClemmensen — Mon, 20 May 2019 19:42:21 GMT

No problem. I'm glad you found your answer. Do you have any questions about the code?

Re: using hash to select observations

changxuosu — Fri, 24 May 2019 14:01:16 GMT

I don't have any questions for now. Have a nice day, warm hearted SAS expert 🙂