BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ashok3395
Fluorite | Level 6

Hi ,

 

I,m trying to calculate percentile (0,.25 ,.5 ,.75,1) for 30 million data.

 

Have tried with Proc Mean/Univariant but I'm getting below errors .

 

 

Proc Mean:

Error

WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for QMETHOD=OS. Consider using

         QMETHOD=P2.

NOTE: The affected statistics will be missing from the corresponding classification levels.

WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for QMETHOD=OS. Consider using

         QMETHOD=P2.

 

 

 

 

 Proc Univariant :

 

ERROR: The SAS System stopped processing this step because of insufficient memory.

WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete.  When this step was stopped there were 0 observations and 31

         variables.

WARNING: Data set SASPL2.ALL_PERCENTILE was not replaced because this step was stopped.

 

Kindly help if there is any other way to calculate the percentile.

 

Regards,

Ashok Arunachalam

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @Ashok3395,

 

Another powerful alternative (for computing exact quartiles) is PROC HPBIN. For 30 million random numbers V1 (standard normal distribution) the memory usage of the step below was less than 1% of what PROC SUMMARY used. On top of that it was more than twice as fast on my machine.

ods select none;
ods output quantile=qtl;
proc hpbin data=test computequantile;
input v1;
run;
ods select all;

Note, however, that PROC HPBIN uses the percentile definition corresponding to PCTLDEF=3 in PROC MEANS/PROC SUMMARY and PROC UNIVARIATE, whereas the default is PCTLDEF=5.


@Ashok3395 wrote:

WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete.  When this step was stopped there were 0 observations and 31 variables.


I suspect that you requested quartiles for several variables and possibly also used CLASS variables.

 

In this case you could cut down memory usage by restricting the computations to one analysis variable at a time and avoiding CLASS variables (use BY or WHERE statements instead if possible).

 

Further, note that multi-threading (in PROC MEANS/PROC SUMMARY) requires more memory, see NOTHREADS option.

View solution in original post

5 REPLIES 5
Astounding
PROC Star

It's good that you showed the log messages, but it would also help to show your code.  Without the code, here is some general advice.

 

Have you considered the suggestion in the log, switching to the P2 method instead of OS?  Basically, when the number of observations is uneven, the are different ways to compute percentiles and some methods require more memory than others.  With 30M observations, it is unlikely that switching to a different method of computing percentiles would meaningfully change the results.

 

Also note that the percentiles you are looking for match up with statistics that are easily specified:

 

min, Q1, median, Q3, max

 

Using those statistic names may reduce the resource requirements.

mkeintz
PROC Star

I think there are multiple ways to address this.

 

Brute force - Increase memory.

See what your default memory allocation is:
   proc options option=memsize;

   run;
My system reports 2 gigabytes:
     MEMSIZE=2147483648


Then depending on your SAS environment, start SAS with a larger memory request.  For instance, on my windows system, I type in the start box the following:
     SAS  -memsize 4G
which grabs 4 gigabytes of memory.  Or if I type
     SAS  -memsize 0

it takes the zero as a signal to take as much memory as available (about 57GB of memory on my machine).

 

 

Give up some precision to reduce memory requirements.
Say your data (variable X in data set HAVE) is recorded to the 3rd decimal place, but all you really need is the first decimal place.  Then you could.
   data need / view=need;
     set have;
     x=round(x,0.1);
   run;
   proc univariate data=need .....;

This would significantly reduce the number of bins for tabulation, and presumably the needed memory.

 

 

Keep precision, but build final frequencies from partial frequencies.

Here's an example using the variable close from data set sashelp.stocks (which has 699 observations):

 

proc freq data=sashelp.stocks (firstobs=1  obs=200) noprint;
  tables close / out=need1;
run;
proc freq data=sashelp.stocks (firstobs=201  obs=400) noprint;
  tables close / out=need2;
run;
proc freq data=sashelp.stocks (firstobs=401 ) noprint;
  tables close / out=need3;
run;

data pctiles;
  do p=0,.25,.50,.75,1;
    output;
  end;
run;

data stats (keep=p close);
  if 0 then set sashelp.stocks nobs=nrecs;

  set pctiles;
  retain cum_recs .;

  do while (cum_recs< p*nrecs);
    set need: ;
    by close;
    cum_recs+count;
  end;
run;

 

Data set NEED1 has frequencies for CLOSE covering the first 200 observations of sashelp.stocks.  NEED1 has the original variable CLOSE, and the new variables COUNT and PERCENT representing the frequency (and percent, which should be ignored) of that value of close.  Also, importantly, NEED1 is sorted by CLOSE.  NEED2 has obs 201-400, and NEED3 has obs 401 and up.

 

The dataset PCTILES specifies the desired percentiles. 

 

The DATA STATS step interleaves, in sorted order, the values and counts from NEED1 through NEED3, tracks the cumulative counts, and finds those percentiles.  Note the statement
   set NEED: ;

has a colon at the end of the data set name.  This tells SAS to read all data sets whose name begins with the characters NEED.

 

Also the DATA STATS step has "if 0 then set sashelp.stocks nobs=nrecs;".  This doesn't actually read data from sashelp.stocks (if 0 is never true), but it does tell the sas compiler to get the number of observations in sashelp.stocks and put that number in variable nrecs.

 

And this program assumes there are no missing values for variable CLOSE.  If there are, you can modify the DATA STATS step to accommodate.

 

The overall presumption here is that you will need less memory for the tabulations of parts of the data set.  For instance, you might try dividing your 30 million into 3 sets of 10 million, or 6 sets of 5 million.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

Your best bet is to use proc means and ask for qmethod=p2. P2 method requires a fixed amount of memory. It is approximate but is usually quite accurate for quartiles.

PG
FreelanceReinh
Jade | Level 19

Hi @Ashok3395,

 

Another powerful alternative (for computing exact quartiles) is PROC HPBIN. For 30 million random numbers V1 (standard normal distribution) the memory usage of the step below was less than 1% of what PROC SUMMARY used. On top of that it was more than twice as fast on my machine.

ods select none;
ods output quantile=qtl;
proc hpbin data=test computequantile;
input v1;
run;
ods select all;

Note, however, that PROC HPBIN uses the percentile definition corresponding to PCTLDEF=3 in PROC MEANS/PROC SUMMARY and PROC UNIVARIATE, whereas the default is PCTLDEF=5.


@Ashok3395 wrote:

WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete.  When this step was stopped there were 0 observations and 31 variables.


I suspect that you requested quartiles for several variables and possibly also used CLASS variables.

 

In this case you could cut down memory usage by restricting the computations to one analysis variable at a time and avoiding CLASS variables (use BY or WHERE statements instead if possible).

 

Further, note that multi-threading (in PROC MEANS/PROC SUMMARY) requires more memory, see NOTHREADS option.

DWilson
Pyrite | Level 9

 If you don't have ties and aren't generating the quantiles for many variables (so that you aren't spending lots of time/memory sorting), you could sort your data and identify the records where your quantiles occur simply by taking the 0th, 25th, 50th, 75th, and 100th positions in the sorted order.

 

e.g. if N=30,000,000 and no ties then you can use the following code to get the quantiles:

 

 

if _N_=1  or _N_=30000000 or _N_=ceil(.25*30000000) or _N_=ceil(.50*30000000) or _N_=ceil(.75*30000000);

 

If you have ties, the issue is much more complicated and I'd go with the HPBIN approach or one of the other recommendations.

 

Also, creating quantiles by levels of other variables complicates determining the number of observations you'll have in your classes which you'd need to figure out and that would add time to this process. You could do that with some by processing which shouldn't be too time intensive; sort followed by two data steps; first data step with some by processing to figuring out numbers of observations in each class and then second data step to select the observations. I guess some book keeping would be needed to know _N_ values delineating the beginning and ending of each set of class records.

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 5477 views
  • 18 likes
  • 6 in conversation