topic Re: Hash Tables - problems with large datasets in SAS Programming

Hash Tables - problems with large datasets

FredGIII — Thu, 16 May 2013 17:31:24 GMT

Does anyone have links to a good beginning tutorial on Hash tables? I have been googling and reading a lot, but I find that the papers, so far, are fairly specific and vague enough that I find it difficult to understand the overall structure. I have a dataset with 530 million observations and 250+ columns of sensor data (~ 3 TB). The powers that be want stats summaries on ALL of the columns (n, min, max, mean, stddev, skew, kurtosis, var) by equipment id by date. Being new to SAS, I did a lot of research and it appears that hash tables would be the best approach but there are several aspects to the programming that are not clear to me.

My initial approach (and please direct me if there is a better approach) is to use the hash tables to subset the data by id (or id/date) and then proc summary on the subset. I tried running the hash subset and ran out of memory (Win7 8GB memory).

data hash_results;

set myLargeDataset;

if (_n_ eq 1) then

do;

declare hash a(dataset:'myLargeDataset');

a.defineKey('equipmentsernum', 'Date');

a.defineData(all:'y');

a.defineDone();

end;

equipmentsernum = '296737';

if(a.find() eq 0);

run;

This code works on a subset of myLargeDataset, but on the big set, it quickly runs out of memory. Some things I haven't figured out with hash tables are:

1) Can I save the resulting hash table to re-use outside of the data step?

2) Can I write a macro to loop through the hash? My thought was to use the hash table to subset myLargeDataset into a smaller table of just one serial number or id, the call proc summary to get stats for that unit, then loop through the next serial number...etc.

Any hash tutorials or pointers would be greatly appreciated.

Regards,

Fred

Re: Hash Tables - problems with large datasets

FredGIII — Thu, 16 May 2013 18:08:49 GMT

Ok before someone slaps me for doing something stupid, I realized that defineData(all:'y') with that large of a dataset was crazy. So I have removed that and am currently running the following against the 3TB dataset just to see if I can create the hash table. But my questions still hold - is there a better approach? Can I save the hashtable? Looping through hash table to subset the data? Links to hash table tutorials?

Thanks,

data hash_results;

set myLargeDataset;

if (_n_ eq 1) then

do;

declare hash a(dataset:'myLargeDataset', multidata:'y');

a.defineKey('equipmentsernum');

a.defineData('equipmentsernum', 'Date');

a.defineDone();

end;

run;

Re: Hash Tables - problems with large datasets

esjackso — Thu, 16 May 2013 18:19:19 GMT

I am still learning Hash objects as well so I will be interesting in other comments but my understanding is they have to fit in memory or they cannot be used. Basically that is where the efficiency are gained because of the bypass of writing results to disk.

I have to be honest and say I havent worked with files of the TB nature and not knowing a lot about the file structure there might be ways to reduce the on disk size to a more manageable size such as looking at any variable lengths and see if there are a lot of empty space (ie character fields that are larger than needed, numeric fields where precision is not needed so less than 8 length could be used). But even tricks like that may not reduce enough and would meaning running through all the data to adjust these, in which case you might as well run proc summary through the dataset and save the results to a dataset that you can report from.

Alternatively, you could add an index to the file but that will also take time and additionally disk resources to store.

Very interesting question ... interested to hear other responses.

Re: Hash Tables - problems with large datasets

esjackso — Thu, 16 May 2013 18:26:23 GMT

I am pretty sure hashes are only available during the sas session and to save them out would mean to create a sas dataset from them which I dont think is your intent (but maybe it is).

Is the data static? Or is it continuously updated? If its being updated then maybe a filter view through proc SQL might be better.

The approach you take may also be dependent on whether this is a one time task or if it will be repeated on some interval. Brute force method might be fine for a one time thing, but for repeated tasks a more efficient process is probably desired.

Re: Hash Tables - problems with large datasets

FredGIII — Thu, 16 May 2013 18:32:52 GMT

EJ - the data is static and we just need to generate summaries to create smaller datasets that we can work with. As for saving the hash table, my only thought was that if I am going to loop through by id and subset then proc summary, that I would either need to save the hash table or regenerate it for each loop. It seems that regenerating it for each loop would be inefficient. But again, that is assuming my approach is valid .

Re: Hash Tables - problems with large datasets

esjackso — Thu, 16 May 2013 18:50:44 GMT

I think im catching on ... so you are using hash somewhat like an index in order to subset the large dataset. I think at this point my hash knowledge has been exhausted.

I think if I was trying to do this I might start with a 20% random sampling of the large dataset stratified by equip id and date (or whatever the summary groupings are). That should give you a small enough dataset to do the summaries on but not take a day for proc summary to run. Of course you would have to run through the data again to set the sample.

I might be leading down the wrong path so I will wait to see if others respond.

Re: Hash Tables - problems with large datasets

Astounding — Thu, 16 May 2013 19:17:34 GMT

You can't save a hash table. And are likely to be other approaches. First, a few preliminary questions ...

How many unique values for equipsernum? (order of magnitude would do)

How many for date?

Do you need to show every date for every equipsernum, or only the dates that actually exist in the data?

I suspect you will end up with a SQL step to extract a table with the equipsernum values:

proc sql noprint;

create table sernums as select distinct equipsernum from MyLargeDataset;

quit;

That would make it easy to loop through using CALL EXECUTE ... generate a separate PROC SUMMARY, CLASS DATE for each EQUIPSERNUM. So if that turns out to be viable I can sketch out more of the code. Of course "viable" doesn't mean "fast". So let's start with the questions above.

Good luck.

Re: Hash Tables - problems with large datasets

FredGIII — Thu, 16 May 2013 20:37:17 GMT

Thanks Astounding (that sounds strange ),

We did think about doing PROC SQL, and we did try a test doing a subset using proc sql for one specific equipsernum, that query alone took about 30 hrs (a bit over 1 day). There are almost 800 unique ids in equipsernum, which means it would take over 2 years to subset the entire dataset.. I was hoping for something a bit speedier LOL.The sensor data is stored at 5 min intervals which means there are 288 observations per day.

Thanks,

Re: Hash Tables - problems with large datasets

Astounding — Thu, 16 May 2013 20:48:19 GMT

With only 800 equipsernums, you should be able to summarize directly with one pass through the data ...

proc summary data=MyLargeDataset nway;

class equipsernum date;

var ...;

output out=summary.stats (drop=_type_ _freq_) ...;

run;

There is a way to specify the AUTONAME option that escapes me, but you should be able to use it so you don't have to spell out the full list of statistics for each variable.

You can run out of memory if you have too many equipsernum date combinations, but we didn't get into how many date values are in the data. Memory usage would be unrelated to the number of observations ... only related to the number of equipsernum date combinations.

Re: Hash Tables - problems with large datasets

FredGIII — Thu, 16 May 2013 21:07:39 GMT

Astounding:

I did mention the date values above (every 5 min, so 288 observations or date values per day). So the summary function would have to summarize the 288 obs for each day for each equipmentsernum.

We did try running the following :

proc summary data=MyLargeDataset nway;

class equipmentsernum Date flag;

Var _numeric_;

output out = SummaryDataset (drop = _type_ _freq_)
Sum =

max =

min =

median =

mean =

std =

Kurt =

Skew =

n =
/autoname

;

run;

We ran out of memory after 30 hrs. By the way, the flag is either 1 or 0 (1= full speed, 0= partial speed). Would it be better to do a By equipmentsernum, Date, flag instead of NWAY?

Re: Hash Tables - problems with large datasets

Astounding — Thu, 16 May 2013 21:26:23 GMT

OK, I guess I was imagining there might be more than 1 day in the data.

Yes, you can definitely switch to a BY statement if the data are sorted. And that would solve the memory problems. So it all depends on what the sorted order to the data is. If it's in order by all three variables, you can just use a BY statement instead of a CLASS statement. But if it's only in order by one variable (say by DATE), you can use a combination:

by date;

class equipmentsernum flag;

Sorting this amount of data doesn't seem realistic, however. You would have to rely on it already being in sorted order.

Re: Hash Tables - problems with large datasets

ChrisNZ — Thu, 16 May 2013 21:51:19 GMT

A hash table should be used as output in this case, not as input since it will never fit in memory.

So you need to read the table sequentially and store the summary values in a hash table as you go.

1- First pass: derive and store in the hash table things like sum, min, max, n, nmiss, etc.

2- Then process and update the hash table to derive things like mean, percentages

3- The second pass is similar to the steps above and used to derive std, var.

4- Output hash table.

Your hash table will have as many rows as there are classification groups, and as many columns as number of _NUM_*number of stats.

In my experience, this runs slower than proc summary if you need 2-pass stats like std, but with a smaller memory footprint. I never worked on such a large dataset though.

My 2 cents.

Re: Hash Tables - problems with large datasets

Tom — Thu, 16 May 2013 21:52:04 GMT

So is your "DATE" variable really a datetime variable? If so you might be able to use it as a class variable by using a format.

proc summary data=MyLargeDataset nway;

class equipmentsernum Date flag;

format date dtdate.;

Re: Hash Tables - problems with large datasets

FredGIII — Thu, 16 May 2013 22:15:27 GMT

Tom,

In this case, the DATE variable is just that, MMDDYYYY. It was created as a subset of a date/time stamp for this reason. And running the proc summary above - we ran out of memory. We do have more memory on order, but have to wait on purchase order approvals, go to sourcing, purchasing..etc So I was hoping to find a quicker way to subset the data. If can subset the data quickly into tables by equipmentsernum then SAS can handle that file size without too much problem.

Thanks,

Re: Hash Tables - problems with large datasets

Reeza — Thu, 16 May 2013 22:21:37 GMT

Can you run a proc contents on those three fields in the class statement and place them here?

You can write a macro to subset the data by equipmentsernum and loop through it if you wanted to, but you'd have to be able to run a proc freq on the dataset first.

Can you run the following without issue?

proc freq data=have noprint;

table
equipmentsernum /out=equiplist;

run;

Re: Hash Tables - problems with large datasets

Astounding — Thu, 16 May 2013 23:49:30 GMT

A simple fix if you have neither sorted data, nor enough memory, is to subset the data. For example, run your PROC SUMMARY as is, but run it twice:

where flag=1;

where flag=0;

You'll need half the memory. If that is still consuming too much memory, run it 10 times:

where equipmentsernum =: '1';

where equipmentsernum =: '2';

etc.

Sure, it will take a while. But any solution will take a while.

Re: Hash Tables - problems with large datasets

Haikuo — Fri, 17 May 2013 02:34:09 GMT

Ok, a little late for the party, but FWIW this is what I would do if you input data is at least sorted by equip id:

/*To get your 250 variable names for Hash use*/

proc sql;

select quote(cats(name)) into :qname separated by ',' from dictionary.columns where LIBNAME='YOURLIBNAME' AND MEMNAME='MYLARGEDATASET';quit;

/*dynamic output dataset by equipmentsernum*/

data _null_;

declare hash h(multidata:'y');

h.definekey('equipmentsernum');

h.definedata(&qname.);

h.definedone();

do until (last.equipmentsernum);

set MYLARGEDATASET;

by equipmentsernum;

rc=h.add();

end;

rc=h.output(dataset:'out'||'_'||equipmentsernum);

run;

After subset, you probably want to use Macro to loop through Proc mean, you can still use &name for downstream Macro processing.

HTH

Haikuo

Re: Hash Tables - problems with large datasets

Ksharp — Fri, 17 May 2013 03:24:25 GMT

If your dataset is sorted before , That would be possible to use Hash Table by cleaning it after you get a group MEAN value.

Ksharp

Message was edited by: xia keshan

Re: Hash Tables - problems with large datasets

FredGIII — Fri, 17 May 2013 14:03:55 GMT

Reeza,

I am running proc freq as you suggested now. I imagine it will take a while to get through the data (if it gets through all of it). Will report back when something happens.

Re: Hash Tables - problems with large datasets

ballardw — Fri, 17 May 2013 16:07:19 GMT

Did you try the summary with a small subset of the VAR instead of _numeric_? I would be tempted to see if doing batches of 10 or so variables would run without exhausting memory and possibly within a reasonable time frame Then merge the resulting summary datasets.