Solved: Re: Calculating percentile for Huge data.

Ashok3395 · Posted 01-02-2019 08:42 AM

Hi ,

I,m trying to calculate percentile (0,.25 ,.5 ,.75,1) for 30 million data.

Have tried with Proc Mean/Univariant but I'm getting below errors .

Proc Mean:

Error

WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for QMETHOD=OS. Consider using

QMETHOD=P2.

NOTE: The affected statistics will be missing from the corresponding classification levels.

WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for QMETHOD=OS. Consider using

QMETHOD=P2.

Proc Univariant :

ERROR: The SAS System stopped processing this step because of insufficient memory.

WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete. When this step was stopped there were 0 observations and 31

variables.

WARNING: Data set SASPL2.ALL_PERCENTILE was not replaced because this step was stopped.

Kindly help if there is any other way to calculate the percentile.

Regards,

Ashok Arunachalam

FreelanceReinh · Posted 01-03-2019 06:11 AM

Hi @Ashok3395,

Another powerful alternative (for computing exact quartiles) is PROC HPBIN. For 30 million random numbers V1 (standard normal distribution) the memory usage of the step below was less than 1% of what PROC SUMMARY used. On top of that it was more than twice as fast on my machine.

ods select none;
ods output quantile=qtl;
proc hpbin data=test computequantile;
input v1;
run;
ods select all;

Note, however, that PROC HPBIN uses the percentile definition corresponding to PCTLDEF=3 in PROC MEANS/PROC SUMMARY and PROC UNIVARIATE, whereas the default is PCTLDEF=5.

@Ashok3395 wrote:
WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete. When this step was stopped there were 0 observations and 31 variables.

I suspect that you requested quartiles for several variables and possibly also used CLASS variables.

In this case you could cut down memory usage by restricting the computations to one analysis variable at a time and avoiding CLASS variables (use BY or WHERE statements instead if possible).

Further, note that multi-threading (in PROC MEANS/PROC SUMMARY) requires more memory, see NOTHREADS option.

View solution in original post

Astounding · Posted 01-02-2019 10:01 AM

It's good that you showed the log messages, but it would also help to show your code. Without the code, here is some general advice.

Have you considered the suggestion in the log, switching to the P2 method instead of OS? Basically, when the number of observations is uneven, the are different ways to compute percentiles and some methods require more memory than others. With 30M observations, it is unlikely that switching to a different method of computing percentiles would meaningfully change the results.

Also note that the percentiles you are looking for match up with statistics that are easily specified:

min, Q1, median, Q3, max

Using those statistic names may reduce the resource requirements.

mkeintz · Posted 01-02-2019 10:45 AM

I think there are multiple ways to address this.

Brute force - Increase memory.

See what your default memory allocation is:
proc options option=memsize;

run;
My system reports 2 gigabytes:
MEMSIZE=2147483648

Then depending on your SAS environment, start SAS with a larger memory request. For instance, on my windows system, I type in the start box the following:
SAS -memsize 4G
which grabs 4 gigabytes of memory. Or if I type
SAS -memsize 0

it takes the zero as a signal to take as much memory as available (about 57GB of memory on my machine).

Give up some precision to reduce memory requirements.
Say your data (variable X in data set HAVE) is recorded to the 3rd decimal place, but all you really need is the first decimal place. Then you could.
   data need / view=need;
     set have;
     x=round(x,0.1);
   run;
   proc univariate data=need .....;

This would significantly reduce the number of bins for tabulation, and presumably the needed memory.

Keep precision, but build final frequencies from partial frequencies.

Here's an example using the variable close from data set sashelp.stocks (which has 699 observations):

proc freq data=sashelp.stocks (firstobs=1  obs=200) noprint;
  tables close / out=need1;
run;
proc freq data=sashelp.stocks (firstobs=201  obs=400) noprint;
  tables close / out=need2;
run;
proc freq data=sashelp.stocks (firstobs=401 ) noprint;
  tables close / out=need3;
run;

data pctiles;
  do p=0,.25,.50,.75,1;
    output;
  end;
run;

data stats (keep=p close);
  if 0 then set sashelp.stocks nobs=nrecs;

  set pctiles;
  retain cum_recs .;

  do while (cum_recs< p*nrecs);
    set need: ;
    by close;
    cum_recs+count;
  end;
run;

Data set NEED1 has frequencies for CLOSE covering the first 200 observations of sashelp.stocks. NEED1 has the original variable CLOSE, and the new variables COUNT and PERCENT representing the frequency (and percent, which should be ignored) of that value of close. Also, importantly, NEED1 is sorted by CLOSE. NEED2 has obs 201-400, and NEED3 has obs 401 and up.

The dataset PCTILES specifies the desired percentiles.

The DATA STATS step interleaves, in sorted order, the values and counts from NEED1 through NEED3, tracks the cumulative counts, and finds those percentiles. Note the statement
set NEED: ;

has a colon at the end of the data set name. This tells SAS to read all data sets whose name begins with the characters NEED.

Also the DATA STATS step has "if 0 then set sashelp.stocks nobs=nrecs;". This doesn't actually read data from sashelp.stocks (if 0 is never true), but it does tell the sas compiler to get the number of observations in sashelp.stocks and put that number in variable nrecs.

And this program assumes there are no missing values for variable CLOSE. If there are, you can modify the DATA STATS step to accommodate.

The overall presumption here is that you will need less memory for the tabulations of parts of the data set. For instance, you might try dividing your 30 million into 3 sets of 10 million, or 6 sets of 5 million.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PGStats · Posted 01-02-2019 05:20 PM

Your best bet is to use proc means and ask for qmethod=p2. P2 method requires a fixed amount of memory. It is approximate but is usually quite accurate for quartiles.

PG

FreelanceReinh · Posted 01-03-2019 06:11 AM

Hi @Ashok3395,

Another powerful alternative (for computing exact quartiles) is PROC HPBIN. For 30 million random numbers V1 (standard normal distribution) the memory usage of the step below was less than 1% of what PROC SUMMARY used. On top of that it was more than twice as fast on my machine.

ods select none;
ods output quantile=qtl;
proc hpbin data=test computequantile;
input v1;
run;
ods select all;

Note, however, that PROC HPBIN uses the percentile definition corresponding to PCTLDEF=3 in PROC MEANS/PROC SUMMARY and PROC UNIVARIATE, whereas the default is PCTLDEF=5.

@Ashok3395 wrote:
WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete. When this step was stopped there were 0 observations and 31 variables.

I suspect that you requested quartiles for several variables and possibly also used CLASS variables.

In this case you could cut down memory usage by restricting the computations to one analysis variable at a time and avoiding CLASS variables (use BY or WHERE statements instead if possible).

Further, note that multi-threading (in PROC MEANS/PROC SUMMARY) requires more memory, see NOTHREADS option.

DWilson · Posted 02-01-2019 01:19 PM

If you don't have ties and aren't generating the quantiles for many variables (so that you aren't spending lots of time/memory sorting), you could sort your data and identify the records where your quantiles occur simply by taking the 0th, 25th, 50th, 75th, and 100th positions in the sorted order.

e.g. if N=30,000,000 and no ties then you can use the following code to get the quantiles:

if _N_=1 or _N_=30000000 or _N_=ceil(.25*30000000) or _N_=ceil(.50*30000000) or _N_=ceil(.75*30000000);

If you have ties, the issue is much more complicated and I'd go with the HPBIN approach or one of the other recommendations.

Also, creating quantiles by levels of other variables complicates determining the number of observations you'll have in your classes which you'd need to figure out and that would add time to this process. You could do that with some by processing which shouldn't be too time intensive; sort followed by two data steps; first data step with some by processing to figuring out numbers of observations in each class and then second data step to select the observations. I guess some book keeping would be needed to know _N_ values delineating the beginning and ending of each set of class records.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!