DATA Step, Macro, Functions and more

hash solution

Accepted Solution Solved
Reply
Valued Guide
Posts: 860
Accepted Solution

hash solution

I am trying to crash course my way through a hash solution with little luck.  I have 5 different account numbers and I am looking to join the first(acctnbr) to any of the following 4.  Then I will sort by acctnbr, opendate desc to find the oldest original account number associated with any real account.  I can use name or ss# to confirm the accuracy.  The first dataset below has 8 mill records, the second has 16 million, and these are representing 3 months, later I will be using 12 months.  Is HASH a viable solution for this many records?  If not I have a another sql join below but I'm not sure that will get the join I am looking for.  If you can take a look at the attempt at the sql join I would appreciate that as well. 

 

 

data consumer_agg;
set consumer;
keep acctnbr origacct opendate;
run;

data fdr_agg;
set fdr;
keep cracct xref1 xref2 xrefacct opendate;
run;

data orig_opendate;
    length acctnbr $20 origacct $20 opendate 8;
    if _n_ = 1 then do;
 declare hash e(dataset: 'work.fdr_agg');
 e.definekey ('acctnbr');
 e.definedata('origacct','opendate','cracct','xref1','xref2','xrefacct');
 e.definedone();
 end;
 set consumer_agg;
 drop rc;
 rc=e.find();
run;

 

 

/*********************/

sql solution:

 

data test1;
set consumer_agg(obs=100);
run;

data test2;
set fdr_agg(obs=100);
run;

proc sql;                  
  create table cartesian_join as    
    select distinct a.acctnbr,a.origacct,a.opendate as cons_date,b.*                    
    from test1 a,
          test2 b
order by opendate desc,cons_date desc;
quit;


Accepted Solutions
Solution
‎01-06-2017 01:55 PM
Respected Advisor
Posts: 4,173

Re: hash solution

@mkeintz

I'm still asking myself why there is a memory issue in first place. The two things @Steelers_In_DC should check and tell us:

1. Memsize option setting

2. Length of account number variable loaded into hash

 

Using the IBAN definition with an account number length of 34 characters I can still load 8M rows into a hash in my environment which then consumes 610MB of memory (see code below). 

data have;
  length acctnum $34;
  do row_num=1 to 8*10**6;
    acctnum='AAA-'||put(row_num,z15.)||put(row_num,z15.);
    output;
  end;
run;

data _null_;
  set have end=last;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have',hashexp:10);
      h1.defineKey('acctnum');
      h1.defineData('row_num');
      h1.defineDone();
    end;
  if last then
    do;
      item_size=h1.item_size;
      hash_size=h1.item_size * h1.num_items;
      hash_size_MB=hash_size/(1024*1024);

      put item_size=;
      put hash_size=;
      put hash_size_MB= best32.;
    end;
run;

Result:

item_size=80
hash_size=640000000
hash_size_MB=610.3515625

View solution in original post


All Replies
Regular Contributor
Posts: 217

Re: hash solution

Posted in reply to Steelers_In_DC

Hi,

 

With data sets in 8 million & 16 Million records counts, I would be very carefule not to run out of memory within your SAS session, and your machine!

 

Hash Object is memory storage, and has a finite limit. Depending how wide your record in each data set (Keys, Data), you could quickly saturate your memory!!

 

There are alternatives to Hash Object to handle merging/joining large data sets like yours, but they come with their own price tag (Memory/CPU/Disk/Code length and complexity).

Choosing the Right Technique to Merge Large Data Sets Efficiently

 

In the past, I have used a single data step with mixture of several hash objects and (Set Key=) statements to handle Fact Table to Multiple Dimension tables joins in a single data pass/read.

 

In short, sometimes, the best solution is a mixture of techniques rather than single technique.

 

Good luck,

Ahmed

 

 

Trusted Advisor
Posts: 1,019

Re: hash solution

Posted in reply to AhmedAl_Attar

Yeah, but 8 million records with 3 variables in the likely lookup table is not that much memory these days - let's say each item adds 50 bytes to the hash object - that's less than half a gigabyte.   Very common in today's computers.  My five year old laptop has 4GB.

 

My question is what is @Steelers_In_DC requesting?

 

  1. Is it to look at each of the 4 account variables in fdr_agg (cracct xref1 xref2 xrefacct), match to acctnbr in consumer_agg and keep the match that finds the oldest opendate from consumer_agg?
  2. If so, then what about OPENDATE, that appears in both consumer_agg and fdr_agg?  Does the OP want both.
  3. does consumer_agg have one record per acctnum?

 

Also

Valued Guide
Posts: 860

Re: hash solution

Unfortunately I do not know what the data looks like.  I might find some bad data, meaning the account number might not be unique.  After I get the dataset that I want I will try to find any duplicate account numbers per ss#.  I believe the code below will get me what I am looking for, I was wondering if there is a better way, the final count will be much larger the what I stated before.  I was using only 3 months, in production I'll be using 12:

 

options mprint symbolgen;

%macro dates1;
%do date = 201601 %to 201605;
alp.consumer&date
%end;
%mend;

data consumer_agg;
set %dates1: indsname= source;
RptMonth = source;
;
keep acctnbr origacct;
run;

%macro dates2;
%do date = 201601 %to 201605;
fdr.fdr&date
%end;
%mend;

data fdr_agg_prep;
set %dates2 : indsname= source;
RptMonth = source;
;
keep cracct xref1 xref2 xrefacct;
run;

data fdr_agg(rename=(cracct2=cracct xref12 = xref1 xref22 = xref2));
set fdr_agg_prep;
format cracct2 xref12 xref22 $16.;
cracct2 = cracct;
xref12 = xref1;
xref22 = xref2;
drop opendate cracct xref1 xref2;
run;

proc sql;                  
  create table cartesian_join as    
    select a.acctnbr,a.origacct,b.cracct,b.xref1,b.xref2,b.xrefacct                    
    from consumer_agg a,
          fdr_agg b;              
quit;

 

 

Trusted Advisor
Posts: 1,019

Re: hash solution

Posted in reply to Steelers_In_DC

If ACCTNUM has duplicates in consumer_agg, how will you know which to use?   Is it fdr_agg really the lookup table as you originally posted?   I guess my question is which opendate are you using to determine the earliest account number, the one in fdr_agg or in consumer_agg?

Trusted Advisor
Posts: 1,019

Re: hash solution

Posted in reply to Steelers_In_DC

I have notions of what you want to do, but I'm thrown for a loop by your sql example.  It generates a cartesian crossing with no matching criteria.  Do you REALLY want to make 8m*16m=128m records?  Then what would you do?

 

I thought the whole idea is to examine which account numbers match between the two data sets, and then find the match with the oldest origdate.  That's why I asked to confirm which vars in consumer_agg is to be matched by which variables in fdr_agg and which origdate var determines the oldest.

 

I'm just going to go on intuition.  I assume there is one record per acctnum in consumer_agg (if there isn't, either clean it up first, presumably by keeping only the oldest entry, or one can modify the hash technque below to accomodate duplicates).  I also assume that you want one output record per incoming fdr_agg record keeping all the fdr_agg vars and adding data from the oldest matching entry from consumer_agg (with origdate renamed to consdate).  The matching process will match acctnum in consumer_agg with each of cracct, xref1, xref2, xrefacct in fdr_agg in order to find the oldest.

 

This is untested

 

data want;
  if _n_=1 then do;
    if 0 then set consumer_agg  (rename=(opendate=consdate));
    declare hash e (hashexp:19,dataset:'consumer_agg (rename=(opendate=consdate))');
      e.definekey('acctnbr');
      e.definedata(all:'y');
      e.definedone();
    rc=e.add();   /* add a "null" record for easy reset to missing values using e.find(key:.)  */
  end;

  set fdr_agg;
  array acct {4} cracct xref1 xref2 xrefacct;
  min_date=.;
  do I=1 to 4;
    rc=e.find(key:acct{I});
    if rc=0 and consdate<min_date then do;
      min_date=consdate;
      min_I=I;
    end;
  end;
  if min_I=. then e.find(key:.);  /* set  consumer_agg vars to missing */
  else e.find(key:acct{min_I});
  drop min_: I ;
run;

 

 

Notes:

  1. The "hashexp:19" tells sas to establish 2**19 (=524,288) lookup "buckets"  (hashexp:20 is the max, default is :8 i think, equivalent to 256 buckets).  In this case, each bucket will average 8m/524K=16 observations in sorted order for binary lookup after hash access to the bucket.  That's where all the memory usage is taken.
Valued Guide
Posts: 860

Re: hash solution

I'm getting the following error:

 

ERROR: Type mismatch for key variable ACCTNBR at line 43 column 19.
ERROR: Keys and data must be specified in the same order and have the same types as given in DefineKey/DefineData at line 43 column
       19.

 

I'm not sure what to do with this.  A quick google search for the error hasn't helped

Trusted Advisor
Posts: 1,019

Re: hash solution

Posted in reply to Steelers_In_DC

@Steelers_In_DC

 

As I said in my other note, I think you want to compare each of the acct variables (cracct xref1 xref2 xrefacct)  in fdr_agg to ACCTNBR in consumer_agg, and then select the consumer_agg record with the earliest opendate.  I presume your goal is to get ORIGACCT and its coresponding opendate using ACCTNUM as the lookup key.   If so, then consumer_agg is the lookup table to put in a hash object, not fdr_agg as you appear to think.

 

And in the end, are you trying to assign every record in FDR_AGG to its proper origacct, or are you merely interested in making a new lookup table for all the account numbers in fdr_agg to origacct.

 

 

 

Valued Guide
Posts: 860

Re: hash solution

Thank you so much for putting this together, but unfortunately the hash solution will not work as I run out of memory. 

 

ERROR: Hash object added 1441776 items when memory failure occurred.
FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 0 observations and 764 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.

 

Any other ideas?

Respected Advisor
Posts: 4,173

Re: hash solution

Posted in reply to Steelers_In_DC

Beside of the memory issue would the code logic @mkeintz posted cover your use case?

 

If so: Would you have sufficient memory to only load the key column into the hash (acctnbr) and a single numerical colum in the data section (the obsnum of the source table row).

 

If there is enough memory then one way I've got around memory restrictions in the past is to load the key and the source table row number into a hash and then in case of a match during lookup load the other data directly from source ( set <source table) point=<row number looked up in hash> ).

Valued Guide
Posts: 860

Re: hash solution

I don't feel comfortbale enough using hash to attempt something like that.  I'm learning a lot about the dataset as I go through this step and would like something I can pick apart if I need to.  

 

Here is what I want to accomplish, I'm wondering if there is a better way:

 

proc sql;
create table history as
select a.*,b.cracct,b.xref1,b.xref2,b.xrefacct,b.opendate as fdr_opendate
from consumer_agg a left join
     fdr_agg on
a.acctnbr = b.cracct or
a.acctnbr = b.xref1 or
a.acctnbr = b.xref2 or
a.acctnbr = xrefacct or
a.origacct = b.cracct or
a.origacct = b.xref1 or
a.origacct = b.xref2 or
a.origacct = xrefacct;

Respected Advisor
Posts: 4,173

Re: hash solution

Posted in reply to Steelers_In_DC

@Steelers_In_DC

I guess we're now at a point where you need to provide some representative sample data (a data step creating such data) and then also post the desired output. 

Trusted Advisor
Posts: 1,019

Re: hash solution

[ Edited ]
Posted in reply to Steelers_In_DC

Now you've presented an SQL a bit different than I imagined:  (1) you're comparing to BOTH acctnum and origacct in consumer_agg, and (2) you're not selecting the oldest match.  I guess you plan to do that in a later step.

 

One approach is to sort consumer_agg by date, and then run my program against portions of it (say about 1.2 million recs at a time).  If you start with the oldest portion, then if there is a hit, you can consider that fdr_agg record fully processed.  Then run the remaining fdr_aggs against the second oldest portion of consumer_agg, etc., etc.

 

Or there is another hash approach that could save lots of space.  It would record each date only once in a master hash object (HOH).  Then for each date, there would be a corresponding slave hash object containing only the acctnumbers it is associated with.  That way you don't record a date value 8 million times in memory (or 16 million if you key on both acctnum and origacct).  The tradeoff, (1) overhead for each slave hash object, and (2) you'll have to loop through dates to find the oldest one matching any of your acct numbers in fdr_agg.

 

Before discussing modifications of the hash code, first reduce consumer_agg to its core, i.e. get rid of all duplicate ACCTNUM and duplicate ORIGACCT records, keeping only the oldest date for each.  How big is the CA_FINAL dataset below?

 

proc sort data=consumer_agg out=ca_sort;

  by consdate;

run;

 

/* Keep only the oldest date for acctnum */

proc sort data=ca_sort (drop=origacct) out=ca_acct nodupkey;

  by acctnum;

run;

 

/* Keep only the oldest date for origacct */

proc sort data=ca_sort (drop=acctnum) out=ca_orig nodupkey;

  by origacct;

run;

 

/* Join the two above, with three vars: date, anum, and atype */ 

data ca_final;

  merge ca_acct (in=ina rename=(acctnum=anum)) ca_orig (in=ino rename=(origacct=anum));

  by origdate;

  length atype $1;

  if ina=ino then acctnum_type='Z';  /* Both origacct and acctnum */

  else if ina then acctnum_type='A'; /* ACCTNUM only */

  else if ino then acctnum_type='O'; /*ORIGACCT only */

run;

 

 

 

Solution
‎01-06-2017 01:55 PM
Respected Advisor
Posts: 4,173

Re: hash solution

@mkeintz

I'm still asking myself why there is a memory issue in first place. The two things @Steelers_In_DC should check and tell us:

1. Memsize option setting

2. Length of account number variable loaded into hash

 

Using the IBAN definition with an account number length of 34 characters I can still load 8M rows into a hash in my environment which then consumes 610MB of memory (see code below). 

data have;
  length acctnum $34;
  do row_num=1 to 8*10**6;
    acctnum='AAA-'||put(row_num,z15.)||put(row_num,z15.);
    output;
  end;
run;

data _null_;
  set have end=last;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have',hashexp:10);
      h1.defineKey('acctnum');
      h1.defineData('row_num');
      h1.defineDone();
    end;
  if last then
    do;
      item_size=h1.item_size;
      hash_size=h1.item_size * h1.num_items;
      hash_size_MB=hash_size/(1024*1024);

      put item_size=;
      put hash_size=;
      put hash_size_MB= best32.;
    end;
run;

Result:

item_size=80
hash_size=640000000
hash_size_MB=610.3515625

Trusted Advisor
Posts: 1,019

Re: hash solution

@Patrick

 

That occured to me too, and then just disappeared from my thought process.

 

Might it be the hashexp param?  What happens to your memory report when you go from hashexp:10 to hashexp:20?  I.e. from 1K to 1M hash buckets?  Perhaps the solution will be to keep the hashexp relatively low. 

 

 

 

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 16 replies
  • 389 views
  • 1 like
  • 4 in conversation