DATA Step, Macro, Functions and more

Hash Tables - problems with large datasets

Reply
Contributor
Posts: 50

Hash Tables - problems with large datasets

Does anyone have links to a good beginning tutorial on Hash tables?  I have been googling and reading a lot, but I find that the papers, so far, are fairly specific and vague enough that I find it difficult to understand the overall structure.  I have a dataset with 530 million observations and 250+ columns of sensor data (~ 3 TB).  The powers that be want stats summaries on ALL of the columns (n, min, max, mean, stddev, skew, kurtosis, var) by equipment id by date.  Being new to SAS, I did a lot of research and it appears that hash tables would be the best approach but there are several aspects to the programming that are not clear to me.

My initial approach (and please direct me if there is a better approach) is to use the hash tables to subset the data by id (or id/date) and then proc summary on the subset.  I tried running the hash subset and ran out of memory (Win7 8GB memory).

data hash_results;

     set myLargeDataset;

     if (_n_ eq 1) then

          do;

               declare hash a(dataset:'myLargeDataset');

               a.defineKey('equipmentsernum', 'Date');

               a.defineData(all:'y');

               a.defineDone();

          end;

     equipmentsernum = '296737';

     if(a.find() eq 0);

run;

This code works on a subset of myLargeDataset, but on the big set, it quickly runs out of memory.  Some things I haven't figured out with hash tables are:

1) Can I save the resulting hash table to re-use outside of the data step?

2) Can I write a macro to loop through the hash? My thought was to use the hash table to subset myLargeDataset into a smaller table of just one serial number or id, the call proc summary to get stats for that unit, then loop through the next serial number...etc.

Any hash tutorials or pointers would be greatly appreciated.


Regards,


Fred

Contributor
Posts: 50

Re: Hash Tables - problems with large datasets

Ok before someone slaps me for doing something stupid, I realized that defineData(all:'y') with that large of a dataset was crazy.  So I have removed that and am currently running the following against the 3TB dataset just to see if I can create the hash table.  But my questions still hold - is there a better approach? Can I save the hashtable? Looping through hash table to subset the data?  Links to hash table tutorials?

Thanks,


F:

data hash_results;

     set myLargeDataset;

     if (_n_ eq 1) then

          do;

               declare hash a(dataset:'myLargeDataset', multidata:'y');

               a.defineKey('equipmentsernum');

               a.defineData('equipmentsernum', 'Date');

               a.defineDone();

          end;

run;

Super Contributor
Posts: 334

Re: Hash Tables - problems with large datasets

I am pretty sure hashes are only available during the sas session and to save them out would mean to create a sas dataset from them which I dont think is your intent (but maybe it is).

Is the data static? Or is it continuously updated? If its being updated then maybe a filter view through proc SQL might be better.

The approach you take may also be dependent on whether this is a one time task or if it will be repeated on some interval. Brute force method might be fine for a one time thing, but for repeated tasks a more efficient process is probably desired.

EJ

Contributor
Posts: 50

Re: Hash Tables - problems with large datasets

EJ - the data is static and we just need to generate summaries to create smaller datasets that we can work with.  As for saving the hash table, my only thought was that if I am going to loop through by id and subset then proc summary, that I would either need to save the hash table or regenerate it for each loop. It seems that regenerating it for each loop would be inefficient.  But again, that is assuming my approach is valid Smiley Happy.

FG

Super Contributor
Posts: 334

Re: Hash Tables - problems with large datasets

I think im catching on ... so you are using hash somewhat like an index in order to subset the large dataset. I think at this point my hash knowledge has been exhausted.

I think if I was trying to do this I might start with a 20% random sampling of the large dataset stratified by equip id and date (or whatever the summary groupings are). That should give you a small enough dataset to do the summaries on but not take a day for proc summary to run. Of course you would have to run through the data again to set the sample.

I might be leading down the wrong path so I will wait to see if others respond.

EJ

Super User
Posts: 5,498

Re: Hash Tables - problems with large datasets

You can't save a hash table.  And are likely to be other approaches.  First, a few preliminary questions ...

How many unique values for equipsernum?  (order of magnitude would do)

How many for date?

Do you need to show every date for every equipsernum, or only the dates that actually exist in the data?

I suspect you will end up with a SQL step to extract a table with the equipsernum values:

proc sql noprint;

   create table sernums as select distinct equipsernum from MyLargeDataset;

quit;

That would make it easy to loop through using CALL EXECUTE ... generate a separate PROC SUMMARY, CLASS DATE for each EQUIPSERNUM.  So if that turns out to be viable I can sketch out more of the code.  Of course "viable" doesn't mean "fast".  So let's start with the questions above.

Good luck.

Contributor
Posts: 50

Re: Hash Tables - problems with large datasets

Posted in reply to Astounding

Thanks Astounding (that sounds strange Smiley Happy),

We did think about doing PROC SQL, and we did try a test doing a subset using proc sql for one specific equipsernum, that query alone took about 30 hrs (a bit over 1 day).  There are almost 800 unique ids in equipsernum, which means it would take over 2 years to subset the entire dataset..  I was hoping for something a bit speedier LOL.The sensor data is stored at 5 min intervals which means there are 288 observations per day.

Thanks,


FG

Super User
Posts: 5,498

Re: Hash Tables - problems with large datasets

With only 800 equipsernums, you should be able to summarize directly with one pass through the data ...

proc summary data=MyLargeDataset nway;

   class equipsernum date;

   var ...;

   output out=summary.stats (drop=_type_ _freq_)   ...;

run;

There is a way to specify the AUTONAME option that escapes me, but you should be able to use it so you don't have to spell out the full list of statistics for each variable.

You can run out of memory if you have too many equipsernum date combinations, but we didn't get into how many date values are in the data.  Memory usage would be unrelated to the number of observations ... only related to the number of equipsernum date combinations.

Contributor
Posts: 50

Re: Hash Tables - problems with large datasets

Posted in reply to Astounding

Astounding:

I did mention the date values above (every 5 min, so 288 observations or date values per day).  So the summary function would have to summarize the 288 obs for each day for each equipmentsernum.

We did try running the following :

proc summary  data=MyLargeDataset nway;

     class equipmentsernum Date flag;

     Var _numeric_;

     output out = SummaryDataset (drop = _type_ _freq_)
          Sum =

          max =

          min =

          median =

          mean =

          std =

          Kurt =

          Skew =

          n =
          /
autoname

          ;

run;

We ran out of memory after 30 hrs.  By the way, the flag is either 1 or 0 (1= full speed, 0= partial speed).  Would it be better to do a By equipmentsernum, Date, flag instead of NWAY? 

FG

Super User
Posts: 5,498

Re: Hash Tables - problems with large datasets

OK, I guess I was imagining there might be more than 1 day in the data.

Yes, you can definitely switch to a BY statement if the data are sorted.  And that would solve the memory problems.  So it all depends on what the sorted order to the data is.  If it's in order by all three variables, you can just use a BY statement instead of a CLASS statement.  But if it's only in order by one variable (say by DATE), you can use a combination:

by date;

class equipmentsernum flag;

Sorting this amount of data doesn't seem realistic, however.  You would have to rely on it already being in sorted order.

Super User
Super User
Posts: 7,039

Re: Hash Tables - problems with large datasets

So is your "DATE" variable really a datetime variable?  If so you might be able to use it as a class variable by using a format.

proc summary  data=MyLargeDataset nway;

     class equipmentsernum Date flag;

    format date dtdate.;

Contributor
Posts: 50

Re: Hash Tables - problems with large datasets

Tom,

In this case, the DATE variable is just that, MMDDYYYY.  It was created as a subset of a date/time stamp for this reason. And running the proc summary above - we ran out of memory.  We do have more memory on order, but have to wait on purchase order approvals, go to sourcing, purchasing..etc  So I was hoping to find a quicker way to subset the data.  If can subset the data quickly into tables by equipmentsernum then SAS can handle that file size without too much problem.

Thanks,

FG

Super User
Posts: 19,770

Re: Hash Tables - problems with large datasets

Can you run a proc contents on those three fields in the class statement and place them here?

You can write a macro to subset the data by equipmentsernum  and loop through it if you wanted to, but you'd have to be able to run a proc freq on the dataset first.


Can you run the following without issue?

proc freq data=have noprint;

table
equipmentsernum  /out=equiplist;

run;



Contributor
Posts: 50

Re: Hash Tables - problems with large datasets

Reeza,

I am running proc freq as you suggested now.  I imagine it will take a while to get through the data (if it gets through all of it).  Will report back when something happens.

FG

Super User
Posts: 5,498

Re: Hash Tables - problems with large datasets

A simple fix if you have neither sorted data, nor enough memory, is to subset the data.  For example, run your PROC SUMMARY as is, but run it twice:

where flag=1;

where flag=0;

You'll need half the memory.  If that is still consuming too much memory, run it 10 times:

where equipmentsernum =: '1';

where equipmentsernum =: '2';

etc.

Sure, it will take a while.  But any solution will take a while.

Ask a Question
Discussion stats
  • 39 replies
  • 1554 views
  • 6 likes
  • 13 in conversation