Hash Tables - problems with large datasets - Page 2

ballardw · Posted 05-17-2013 12:07 PM

Did you try the summary with a small subset of the VAR instead of _numeric_? I would be tempted to see if doing batches of 10 or so variables would run without exhausting memory and possibly within a reasonable time frame Then merge the resulting summary datasets.

FredGIII · Posted 05-17-2013 12:13 PM

ballardw:

If I understand you correctly, are you suggesting that I do a proc summary on a subset of the 290 variables in the table - as an example below of just 5 sensors. But perhaps break up the 290 variables into groups of 10 or 20? :

proc summary data=MyLargeDataset nway;

class equipmentsernum Date flag;

Var sensor1 sensor2 sensor3 sensor4 sensor5;

output out = SummaryDataset (drop = _type_ _freq_)
Sum =

max =

min =

median =

mean =

std =

Kurt =

Skew =

n =
/autoname

;

run;

ballardw · Posted 05-17-2013 04:04 PM

Exactly. The first with a small number of sensors to see if the memory use and possibly time decreases. I would also add a data=MyLargeDataset ( Keep= equipmentsernum Date flag Sensor1-Sensor5) to restrict memory use further.

Create SummaryDataSet1 SummaryDataSet2 ... SummaryDataSetn with different sets of sensors and merge.

Just a thought.

You may also want to look at the MEMSIZE and SUMSIZE options. If the proc needs more memory than MEMSIZE there could be a lot of disk acess. If you have more actual memory the OS isn't using then increasing MEMSIZE may help.

FredGIII · Posted 05-20-2013 09:31 AM

Before I try summarizing, I did write a short program to try and split the dataset into individual datasets for each equipmentsernum. Here is the program - please feel free to critique and offer suggestions on better programming :

%macro SplitFile;

/* get list of unique serial numbs */

proc sql noprint;

select equipmentsernum into :sernum separated by " "

from unique_serial_number;

quit;

/* second list of sernum with sn_ for valid sas dataset name */

proc sql noprint;

select "sn"||"_"||equipmentsernum into :tag separated by " "

from unique_serial_number;

quit;

%let i = 1;

%do %while(%scan(&sernum, &i) ne);

data %scan(&tag, &i);

set MyLargeDataset;

where equipmentsernum = "%scan(&sernum, &i)";

run;

%let i= %eval(&i+1);

%end;

%mend SplitFile;

The program works, but I let it run all weekend and in 58hrs it was just finished with one serial number. Would it be beneficial to use hash tables in conjunction with this macro to speed up the file split? Or do you still think summarizing the data with a subset of variables would be faster? Other ideas?

Thanks

FG

esjackso · Posted 05-16-2013 02:19 PM

I am still learning Hash objects as well so I will be interesting in other comments but my understanding is they have to fit in memory or they cannot be used. Basically that is where the efficiency are gained because of the bypass of writing results to disk.

I have to be honest and say I havent worked with files of the TB nature and not knowing a lot about the file structure there might be ways to reduce the on disk size to a more manageable size such as looking at any variable lengths and see if there are a lot of empty space (ie character fields that are larger than needed, numeric fields where precision is not needed so less than 8 length could be used). But even tricks like that may not reduce enough and would meaning running through all the data to adjust these, in which case you might as well run proc summary through the dataset and save the results to a dataset that you can report from.

Alternatively, you could add an index to the file but that will also take time and additionally disk resources to store.

Very interesting question ... interested to hear other responses.

EJ

ChrisNZ · Posted 05-16-2013 05:51 PM

A hash table should be used as output in this case, not as input since it will never fit in memory.

So you need to read the table sequentially and store the summary values in a hash table as you go.

1- First pass: derive and store in the hash table things like sum, min, max, n, nmiss, etc.

2- Then process and update the hash table to derive things like mean, percentages

3- The second pass is similar to the steps above and used to derive std, var.

4- Output hash table.

Your hash table will have as many rows as there are classification groups, and as many columns as number of _NUM_*number of stats.

In my experience, this runs slower than proc summary if you need 2-pass stats like std, but with a smaller memory footprint. I never worked on such a large dataset though.

My 2 cents.

High-Performance SAS Coding - Third Edition

Haikuo · Posted 05-16-2013 10:34 PM

Ok, a little late for the party, but FWIW this is what I would do if you input data is at least sorted by equip id:

/*To get your 250 variable names for Hash use*/

proc sql;

select quote(cats(name)) into :qname separated by ',' from dictionary.columns where LIBNAME='YOURLIBNAME' AND MEMNAME='MYLARGEDATASET';quit;

/*dynamic output dataset by equipmentsernum*/

data _null_;

declare hash h(multidata:'y');

h.definekey('equipmentsernum');

h.definedata(&qname.);

h.definedone();

do until (last.equipmentsernum);

set MYLARGEDATASET;

by equipmentsernum;

rc=h.add();

end;

rc=h.output(dataset:'out'||'_'||equipmentsernum);

run;

After subset, you probably want to use Macro to loop through Proc mean, you can still use &name for downstream Macro processing.

HTH

Haikuo

FredGIII · Posted 05-20-2013 11:40 AM

Haikuo;

I tried your approach and got an insufficient memory to execute data step program error. I liked your approach and learned a lot from your code. I was bummed when I got the message. I suspect I may have to go back to the data source and see if they can supply the data in segmented form. We are installing a server based SAS system and I probably could handle the file once that is done, but timing wise, that may be a bit late. So still trying to figure out a way to split it in a reasonable time frame.

Thanks,

FG

Ksharp · Posted 05-16-2013 11:24 PM

If your dataset is sorted before , That would be possible to use Hash Table by cleaning it after you get a group MEAN value.

Ksharp

Message was edited by: xia keshan

GadyKotler · Posted 05-20-2013 12:40 PM

In order to solve this size of a problem in a reasonable time you'll need to use proc means (as you have been using) (Hashing with only 8 GB of RAM will run out of memory immediately).

Try to increase memory size known to the SAS session ("Memsize" and "Sumsize" invocation options). Try to use SPDE engine with array disk storage.

Another approach is to split the 3TB data set into many small data sets by the values of class variables equipment and date and then use proc means to get the statistics for each of the small data sets.

You'll need a SAS Macro knowledge to this "trick" and a table of all possible combinations of "equipment" and "date" in you 3 TB data set. Then in one datastep you can split the file into smaller parts. Then execute proc means for each of the parts. At the end of the job you can append the final results back again into one data set.

Gady Kotler

www.bixforsas.com

FredGIII · Posted 05-20-2013 01:07 PM

HI Gady,

It appears to me that the default for Memsize is 0 which allocates the maximum memory available. It is not clear to me what benefit changing the Memsize would have. I am not familiar with SPDE, but I have to assume that it is a separate product and I do not believe we have a license for it - so not a choice.

As for subsetting the data into smaller datasets, I did try that, but it took 3 days to subset one serial number. With the number of unique serial numbers, at that rate it would take over 6 years to complete the file splits... talk about job security LOL. Fortunately, it appears that I can get a drive sent to me that contains the separate files by serial number. I'll have to import those as separate datasets and then use a macro to loop through the individual datasets and run proc means on each one separately. At this point, I don't see another realistic approach. If you have other ideas to try, I would love to try them though.

Thanks,

FG

GadyKotler · Posted 05-20-2013 04:02 PM

Hi ,

SPDE is included with SAS BASE and it does not require a separate license.

I don't see why scanning your 3 TB once takes 3 days..? I have experienced scanning 1.3 Billions X 70 variables sas data set stored as a SPDE data set on a LINUX box + disk storage within 25 minutes...

Gady

FredGIII · Posted 05-20-2013 04:08 PM

Gady,

Thanks for the info. We'll look into SPDE and figure out how to set it up. That sounds promising. If you have links to specific documents explaining SPDE, it would be helpful . Otherwise, it will be google searching I go

AhmedAl_Attar · Posted 05-20-2013 05:01 PM

Fred,

You can find information about SPDE here http://support.sas.com/rnd/scalability/spde/index.html

Check out the Setup link http://support.sas.com/rnd/scalability/spde/setup.html

From Reading your original message, I'm guessing you are running on Windows 7 (64 bit?) Desktop with (Dual/Quad) Core and 8 GB memory, with Single Hard Drive. Please correct me if my conclusion is wrong.

SPDE is best suited when you have Multiple CPUs & Hard Drives, because it's all about threading and I/O.

Back to your performance issues. I don't recall the OS you are running your SAS jobs on. Having Memsize set to 0 is not always a good option!! You are better of specifying a reasonable number ( x MB/GB) pending on your available memory, so you can somewhat avoid causing swapping and negatively affecting your OS and Server.

Using CLASS in your Proc Summary/Means will try to load all the unique Combinations of your specified variables into Memory. (Bad Idea)

Using BY in your Proc Summary/Means will utilize the Sorted Order/ Indexes you have in place.

If all fails, I'm guessing your only option is Divide and Conquer approach. Split the data into small Chunks, Summarize, Append. Once all chuncks have been summed and appended, re-summarize for your grand total figures.

Hope this helps,

Ahmed

FredGIII · Posted 05-20-2013 05:11 PM

Ahmed,

Thanks for the reply. Your assumptions are correct - I am running SAS on Win7 64bit, Quad core, 8GB ram and 24TB NAS drive. Based on your comments, it sounds like SPDE might not benefit me. I am in the process of indexing the dataset and I am hoping that will help speed up the subsetting process. In the mean time, I am having a hard drive shipped that contains the individual files for each turbine. We are also in the process of installing a Server based SAS installation, but that is probably a month from being complete. The good news... its a learning experience .

FG

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Re: Hash Tables - problems with large datasets

Classroom Training Available!