Hash component object experiments

Oleg_L — Mon, 19 Oct 2009 05:17:29 GMT

Hello!

I am trying to sort large dataset using hash but getting a fatal error:

170 data dssxx; set dss0906 (obs=20000000); run;

NOTE: There were 20000000 observations read from the data set WORK.DSS0906.
NOTE: The data set WORK.DSSXX has 20000000 observations and 8 variables.
NOTE: Compressing data set WORK.DSSXX decreased size by 27.93 percent.
Compressed is 228803 pages; un-compressed would require 317461 pages.
NOTE: DATA statement used (Total process time):
real time 3:04.45
cpu time 1:19.01

171 data dss;
172 if 0 then set dssxx;
173 dcl hash hh (dataset: 'work.dssxx', hashexp: 0, ordered: 'd');
174 dcl hiter hi ('hh');
175 hh.definekey ('ss_kod', 'site', 'sbal_kod', 'data' );
176 hh.definedata ('site', 'sbal_kod', 'ss_kod', 'data', 'ss_ostd', 'ss_ostc');
177 hh.definedone();
178 do rc=hi.first() by 0 while(rc=0);
179 ost=ss_ostd-ss_ostc;
180 output;
181 rc=hi.next();
182 end;
183 drop rc ss_ostd ss_ostc ss_obd ss_obc; rename data=date;
184 stop;
185 run;

FATAL: Insufficient memory to execute data step program. Aborted during the EXECUTION phase.
NOTE: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.DSS may be incomplete. When this step was stopped there were 0 observations and 5 variables.
WARNING: Data set WORK.DSS was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 41.31 seconds
cpu time 40.59 seconds

The smaller subset sorts fine by the same code:

186 data dssxx; set dss0906 ; where site ne 'MSK'; run;

NOTE: There were 4118570 observations read from the data set WORK.DSS0906.
WHERE site not = 'MSK';
NOTE: The data set WORK.DSSXX has 4118570 observations and 8 variables.
NOTE: Compressing data set WORK.DSSXX decreased size by 29.04 percent.
Compressed is 46390 pages; un-compressed would require 65375 pages.
NOTE: DATA statement used (Total process time):
real time 6:48.00
cpu time 1:08.98

187 data dss;
188 if 0 then set dssxx;
189 dcl hash hh (dataset: 'work.dssxx', hashexp: 0, ordered: 'd');
190 dcl hiter hi ('hh');
191 hh.definekey ('ss_kod', 'site', 'sbal_kod', 'data' );
192 hh.definedata ('site', 'sbal_kod', 'ss_kod', 'data', 'ss_ostd', 'ss_ostc');
193 hh.definedone();
194 do rc=hi.first() by 0 while(rc=0);
195 ost=ss_ostd-ss_ostc;
196 output;
197 rc=hi.next();
198 end;
199 drop rc ss_ostd ss_ostc ss_obd ss_obc; rename data=date;
200 stop;
201 run;

NOTE: There were 4118570 observations read from the data set WORK.DSSXX.
NOTE: The data set WORK.DSS has 4118570 observations and 5 variables.
NOTE: Compressing data set WORK.DSS increased size by 7.72 percent.
Compressed is 43927 pages; un-compressed would require 40779 pages.
NOTE: DATA statement used (Total process time):
real time 27.81 seconds
cpu time 18.78 seconds

Does it mean that hashing not good for large datasets? Or something wrong with my code?
Machine is under Windows XP, 2.5 GB RAM, enough free disc space.

Thanks for any thoughts.

Re: Hash component object experiments

LinusH — Mon, 19 Oct 2009 10:04:53 GMT

The hash object I believe is stored in RAM, so large disk does not help.
We have the same experience as you, too large look-up tables will exhaust memory, so we have to use SQL-joins etc for the largest look-ups.

See this note, there might be way to work-around if you like to stick with hashing:
http://support.sas.com/kb/16/920.html

/Linus

Re: Hash component object experiments

DanielSantos — Mon, 19 Oct 2009 10:48:30 GMT

Yep. Beside that, the maximum hash buckets you get are about 2^16=65536 (hashexp=16). So 20000000 objects would produce a pretty poor performance. I would say depending on the row length your are trying to store, 2 to 3 Millions would be the maximum reasonable objects to load into the hash. Ideally, 1.5 Millions.

Check the following exquisite paper of Paul Dorfman about Hashing:
http://support.sas.com/resources/papers/proceedings09/153-2009.pdf

If you are looking for a performance sort, check the parallel features of the SAS engine, a well known solution is to break up your large dataset into smaller ones, sort them using parallel processing, then reunite the results with interleaving (data step).

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Re: Hash component object experiments

Oleg_L — Mon, 19 Oct 2009 11:13:09 GMT

Thanks a lot for information. It's clear now.

topic Re: Hash component object experiments in SAS Programming

Hash component object experiments

Re: Hash component object experiments

Re: Hash component object experiments

Re: Hash component object experiments