BookmarkSubscribeRSS Feed
FredGIII
Quartz | Level 8

Does anyone have links to a good beginning tutorial on Hash tables?  I have been googling and reading a lot, but I find that the papers, so far, are fairly specific and vague enough that I find it difficult to understand the overall structure.  I have a dataset with 530 million observations and 250+ columns of sensor data (~ 3 TB).  The powers that be want stats summaries on ALL of the columns (n, min, max, mean, stddev, skew, kurtosis, var) by equipment id by date.  Being new to SAS, I did a lot of research and it appears that hash tables would be the best approach but there are several aspects to the programming that are not clear to me.

My initial approach (and please direct me if there is a better approach) is to use the hash tables to subset the data by id (or id/date) and then proc summary on the subset.  I tried running the hash subset and ran out of memory (Win7 8GB memory).

data hash_results;

     set myLargeDataset;

     if (_n_ eq 1) then

          do;

               declare hash a(dataset:'myLargeDataset');

               a.defineKey('equipmentsernum', 'Date');

               a.defineData(all:'y');

               a.defineDone();

          end;

     equipmentsernum = '296737';

     if(a.find() eq 0);

run;

This code works on a subset of myLargeDataset, but on the big set, it quickly runs out of memory.  Some things I haven't figured out with hash tables are:

1) Can I save the resulting hash table to re-use outside of the data step?

2) Can I write a macro to loop through the hash? My thought was to use the hash table to subset myLargeDataset into a smaller table of just one serial number or id, the call proc summary to get stats for that unit, then loop through the next serial number...etc.

Any hash tutorials or pointers would be greatly appreciated.


Regards,


Fred

39 REPLIES 39
FredGIII
Quartz | Level 8

Ok before someone slaps me for doing something stupid, I realized that defineData(all:'y') with that large of a dataset was crazy.  So I have removed that and am currently running the following against the 3TB dataset just to see if I can create the hash table.  But my questions still hold - is there a better approach? Can I save the hashtable? Looping through hash table to subset the data?  Links to hash table tutorials?

Thanks,


F:

data hash_results;

     set myLargeDataset;

     if (_n_ eq 1) then

          do;

               declare hash a(dataset:'myLargeDataset', multidata:'y');

               a.defineKey('equipmentsernum');

               a.defineData('equipmentsernum', 'Date');

               a.defineDone();

          end;

run;

esjackso
Quartz | Level 8

I am pretty sure hashes are only available during the sas session and to save them out would mean to create a sas dataset from them which I dont think is your intent (but maybe it is).

Is the data static? Or is it continuously updated? If its being updated then maybe a filter view through proc SQL might be better.

The approach you take may also be dependent on whether this is a one time task or if it will be repeated on some interval. Brute force method might be fine for a one time thing, but for repeated tasks a more efficient process is probably desired.

EJ

FredGIII
Quartz | Level 8

EJ - the data is static and we just need to generate summaries to create smaller datasets that we can work with.  As for saving the hash table, my only thought was that if I am going to loop through by id and subset then proc summary, that I would either need to save the hash table or regenerate it for each loop. It seems that regenerating it for each loop would be inefficient.  But again, that is assuming my approach is valid Smiley Happy.

FG

esjackso
Quartz | Level 8

I think im catching on ... so you are using hash somewhat like an index in order to subset the large dataset. I think at this point my hash knowledge has been exhausted.

I think if I was trying to do this I might start with a 20% random sampling of the large dataset stratified by equip id and date (or whatever the summary groupings are). That should give you a small enough dataset to do the summaries on but not take a day for proc summary to run. Of course you would have to run through the data again to set the sample.

I might be leading down the wrong path so I will wait to see if others respond.

EJ

Astounding
PROC Star

You can't save a hash table.  And are likely to be other approaches.  First, a few preliminary questions ...

How many unique values for equipsernum?  (order of magnitude would do)

How many for date?

Do you need to show every date for every equipsernum, or only the dates that actually exist in the data?

I suspect you will end up with a SQL step to extract a table with the equipsernum values:

proc sql noprint;

   create table sernums as select distinct equipsernum from MyLargeDataset;

quit;

That would make it easy to loop through using CALL EXECUTE ... generate a separate PROC SUMMARY, CLASS DATE for each EQUIPSERNUM.  So if that turns out to be viable I can sketch out more of the code.  Of course "viable" doesn't mean "fast".  So let's start with the questions above.

Good luck.

FredGIII
Quartz | Level 8

Thanks Astounding (that sounds strange Smiley Happy),

We did think about doing PROC SQL, and we did try a test doing a subset using proc sql for one specific equipsernum, that query alone took about 30 hrs (a bit over 1 day).  There are almost 800 unique ids in equipsernum, which means it would take over 2 years to subset the entire dataset..  I was hoping for something a bit speedier LOL.The sensor data is stored at 5 min intervals which means there are 288 observations per day.

Thanks,


FG

Astounding
PROC Star

With only 800 equipsernums, you should be able to summarize directly with one pass through the data ...

proc summary data=MyLargeDataset nway;

   class equipsernum date;

   var ...;

   output out=summary.stats (drop=_type_ _freq_)   ...;

run;

There is a way to specify the AUTONAME option that escapes me, but you should be able to use it so you don't have to spell out the full list of statistics for each variable.

You can run out of memory if you have too many equipsernum date combinations, but we didn't get into how many date values are in the data.  Memory usage would be unrelated to the number of observations ... only related to the number of equipsernum date combinations.

FredGIII
Quartz | Level 8

Astounding:

I did mention the date values above (every 5 min, so 288 observations or date values per day).  So the summary function would have to summarize the 288 obs for each day for each equipmentsernum.

We did try running the following :

proc summary  data=MyLargeDataset nway;

     class equipmentsernum Date flag;

     Var _numeric_;

     output out = SummaryDataset (drop = _type_ _freq_)
          Sum =

          max =

          min =

          median =

          mean =

          std =

          Kurt =

          Skew =

          n =
          /
autoname

          ;

run;

We ran out of memory after 30 hrs.  By the way, the flag is either 1 or 0 (1= full speed, 0= partial speed).  Would it be better to do a By equipmentsernum, Date, flag instead of NWAY? 

FG

Astounding
PROC Star

OK, I guess I was imagining there might be more than 1 day in the data.

Yes, you can definitely switch to a BY statement if the data are sorted.  And that would solve the memory problems.  So it all depends on what the sorted order to the data is.  If it's in order by all three variables, you can just use a BY statement instead of a CLASS statement.  But if it's only in order by one variable (say by DATE), you can use a combination:

by date;

class equipmentsernum flag;

Sorting this amount of data doesn't seem realistic, however.  You would have to rely on it already being in sorted order.

Tom
Super User Tom
Super User

So is your "DATE" variable really a datetime variable?  If so you might be able to use it as a class variable by using a format.

proc summary  data=MyLargeDataset nway;

     class equipmentsernum Date flag;

    format date dtdate.;

FredGIII
Quartz | Level 8

Tom,

In this case, the DATE variable is just that, MMDDYYYY.  It was created as a subset of a date/time stamp for this reason. And running the proc summary above - we ran out of memory.  We do have more memory on order, but have to wait on purchase order approvals, go to sourcing, purchasing..etc  So I was hoping to find a quicker way to subset the data.  If can subset the data quickly into tables by equipmentsernum then SAS can handle that file size without too much problem.

Thanks,

FG

Reeza
Super User

Can you run a proc contents on those three fields in the class statement and place them here?

You can write a macro to subset the data by equipmentsernum  and loop through it if you wanted to, but you'd have to be able to run a proc freq on the dataset first.


Can you run the following without issue?

proc freq data=have noprint;

table
equipmentsernum  /out=equiplist;

run;



FredGIII
Quartz | Level 8

Reeza,

I am running proc freq as you suggested now.  I imagine it will take a while to get through the data (if it gets through all of it).  Will report back when something happens.

FG

Astounding
PROC Star

A simple fix if you have neither sorted data, nor enough memory, is to subset the data.  For example, run your PROC SUMMARY as is, but run it twice:

where flag=1;

where flag=0;

You'll need half the memory.  If that is still consuming too much memory, run it 10 times:

where equipmentsernum =: '1';

where equipmentsernum =: '2';

etc.

Sure, it will take a while.  But any solution will take a while.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 39 replies
  • 4684 views
  • 6 likes
  • 13 in conversation