<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Calculating percentile for Huge data. in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524239#M73491</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/235185"&gt;@Ashok3395&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Another powerful alternative (for computing exact quartiles) is &lt;A href="https://documentation.sas.com/?docsetId=prochp&amp;amp;docsetTarget=prochp_hpbin_details01.htm&amp;amp;docsetVersion=9.4&amp;amp;locale=en" target="_blank"&gt;PROC HPBIN&lt;/A&gt;. For 30 million random numbers V1 (standard normal distribution) the memory usage of the step below was less than 1% of what PROC SUMMARY used. On top of that it was more than twice as fast on my machine.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;ods select none;
ods output quantile=qtl;
proc hpbin data=test computequantile;
input v1;
run;
ods select all;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Note, however, that PROC HPBIN uses the&amp;nbsp;percentile definition corresponding to &lt;A href="https://documentation.sas.com/?docsetId=procstat&amp;amp;docsetTarget=procstat_univariate_details14.htm&amp;amp;docsetVersion=9.4&amp;amp;locale=en" target="_blank"&gt;PCTLDEF=&lt;/A&gt;3 in PROC MEANS/PROC SUMMARY and PROC UNIVARIATE, whereas the default is PCTLDEF=5.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/235185"&gt;@Ashok3395&lt;/a&gt;&amp;nbsp;wrote:
&lt;P&gt;WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete.&amp;nbsp; When this step was stopped there were 0 observations and &lt;STRONG&gt;31&lt;/STRONG&gt;&amp;nbsp;variables.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I suspect that you requested quartiles for several variables and possibly also used CLASS variables.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In this case you could&amp;nbsp;cut down&amp;nbsp;memory usage by restricting the computations to one analysis variable at a time and avoiding CLASS variables (use BY or WHERE statements instead if possible).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Further, note that multi-threading (in PROC MEANS/PROC SUMMARY) requires more memory, see &lt;A href="https://documentation.sas.com/?docsetId=proc&amp;amp;docsetTarget=n1qnc9bddfvhzqn105kqitnf29cp.htm&amp;amp;docsetVersion=9.4&amp;amp;locale=en#p0sr73r5sqdysln1pj92fpax4ab2" target="_blank"&gt;NOTHREADS option&lt;/A&gt;.&lt;/P&gt;</description>
    <pubDate>Thu, 03 Jan 2019 11:11:19 GMT</pubDate>
    <dc:creator>FreelanceReinh</dc:creator>
    <dc:date>2019-01-03T11:11:19Z</dc:date>
    <item>
      <title>Calculating percentile for Huge data.</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524085#M73483</link>
      <description>&lt;P&gt;Hi ,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I,m trying to calculate percentile &lt;STRONG&gt;(0,.25 ,.5 ,.75,1)&amp;nbsp;&lt;/STRONG&gt;for 30 million data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Have tried with Proc Mean/Univariant but I'm getting below errors .&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Proc Mean:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Error&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for QMETHOD=OS. Consider using&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;QMETHOD=P2.&lt;/P&gt;&lt;P&gt;NOTE: The affected statistics will be missing from the corresponding classification levels.&lt;/P&gt;&lt;P&gt;WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for QMETHOD=OS. Consider using&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;QMETHOD=P2.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;Proc&amp;nbsp;Univariant&amp;nbsp;:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;ERROR: The SAS System stopped processing this step because of insufficient memory.&lt;/P&gt;&lt;P&gt;WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete.&amp;nbsp; When this step was stopped there were 0 observations and 31&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;variables.&lt;/P&gt;&lt;P&gt;WARNING: Data set SASPL2.ALL_PERCENTILE was not replaced because this step was stopped.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Kindly help if there is any other way to calculate the percentile.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Ashok Arunachalam&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jan 2019 13:42:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524085#M73483</guid>
      <dc:creator>Ashok3395</dc:creator>
      <dc:date>2019-01-02T13:42:18Z</dc:date>
    </item>
    <item>
      <title>Re: Calculating percentile for Huge data.</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524105#M73484</link>
      <description>&lt;P&gt;It's good that you showed the log messages, but it would also help to show your code.&amp;nbsp; Without the code, here is some general advice.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Have you considered the suggestion in the log, switching to the P2 method instead of OS?&amp;nbsp; Basically, when the number of observations is uneven, the are different ways to compute percentiles and some methods require more memory than others.&amp;nbsp; With 30M observations, it is unlikely that switching to a different method of computing percentiles would meaningfully change the results.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also note that the percentiles you are looking for match up with statistics that are easily specified:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;min, Q1, median, Q3, max&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Using those statistic names may reduce the resource requirements.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jan 2019 15:01:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524105#M73484</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2019-01-02T15:01:55Z</dc:date>
    </item>
    <item>
      <title>Re: Calculating percentile for Huge data.</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524119#M73485</link>
      <description>&lt;P&gt;I think there are multiple ways to address this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Brute force -&amp;nbsp;Increase memory.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;See what your default memory allocation is:&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp; proc options option=memsize;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp; run;&lt;/FONT&gt;&lt;BR /&gt;My system reports 2 gigabytes:&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;FONT face="terminal,monaco"&gt;MEMSIZE=2147483648&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;LI-WRAPPER&gt;&lt;/LI-WRAPPER&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;Then depending on your SAS environment, start SAS with a larger memory request.&amp;nbsp; For instance, on my windows system, I type in the start box the following:&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; SAS&amp;nbsp; -memsize 4G&lt;/FONT&gt;&lt;BR /&gt;which grabs 4 gigabytes of memory.&amp;nbsp; Or if I type&lt;BR /&gt;&lt;FONT face="Courier New"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; SAS&amp;nbsp; -memsize 0&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;it takes the zero as a signal to take as much memory as available (about 57GB of memory on my machine).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Give up some precision to reduce memory requirements.&lt;/STRONG&gt;&lt;BR /&gt;Say your data (variable X in data set HAVE) is&amp;nbsp;recorded to the 3rd decimal place, but all you really need is the first decimal place.&amp;nbsp; Then you could.&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;data need / view=need;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;set have;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;x=round(x,0.1);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;run;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp; proc univariate data=need .....;&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;This would significantly reduce the number of bins for tabulation, and presumably the needed memory.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Keep precision, but build&amp;nbsp;final frequencies from partial frequencies.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Here's an example using the variable close from data set sashelp.stocks (which has 699 observations):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc freq data=sashelp.stocks (firstobs=1  obs=200) noprint;
  tables close / out=need1;
run;
proc freq data=sashelp.stocks (firstobs=201  obs=400) noprint;
  tables close / out=need2;
run;
proc freq data=sashelp.stocks (firstobs=401 ) noprint;
  tables close / out=need3;
run;

data pctiles;
  do p=0,.25,.50,.75,1;
    output;
  end;
run;

data stats (keep=p close);
  if 0 then set sashelp.stocks nobs=nrecs;

  set pctiles;
  retain cum_recs .;

  do while (cum_recs&amp;lt; p*nrecs);
    set need: ;
    by close;
    cum_recs+count;
  end;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Data set NEED1 has frequencies for CLOSE covering the first 200 observations of sashelp.stocks.&amp;nbsp; NEED1 has the original variable CLOSE, and the new variables COUNT and PERCENT representing the frequency (and percent, which should be ignored) of that value of close.&amp;nbsp; Also, importantly,&amp;nbsp;NEED1 is sorted by CLOSE.&amp;nbsp; NEED2 has obs 201-400, and NEED3 has obs 401 and up.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The dataset PCTILES specifies the desired percentiles.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The DATA STATS step interleaves, in sorted order,&amp;nbsp;the values and counts from NEED1 through NEED3, tracks the cumulative counts, and finds those percentiles.&amp;nbsp; Note the statement&lt;BR /&gt;&lt;FONT face="terminal,monaco"&gt;&amp;nbsp;&amp;nbsp; set NEED: ;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;has a colon at the end of the data set name.&amp;nbsp; This tells SAS to read all data sets whose name begins with the characters NEED.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also the&amp;nbsp;DATA STATS step has "&lt;FONT face="terminal,monaco"&gt;if 0 then set sashelp.stocks nobs=nrecs;&lt;/FONT&gt;".&amp;nbsp; This doesn't actually read data from sashelp.stocks (if 0 is never true), but it does tell the sas compiler to get the number of observations in sashelp.stocks and put that number in variable nrecs.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;And this program assumes there are no missing values for variable CLOSE.&amp;nbsp; If there are, you can modify the DATA STATS step to accommodate.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The overall presumption here is that you will need less memory for the tabulations of parts of the data set.&amp;nbsp; For instance, you might try dividing your 30 million into 3 sets of 10 million, or 6 sets of 5 million.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jan 2019 15:45:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524119#M73485</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2019-01-02T15:45:29Z</dc:date>
    </item>
    <item>
      <title>Re: Calculating percentile for Huge data.</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524186#M73487</link>
      <description>&lt;P&gt;Your best bet is to use proc means and ask for qmethod=p2. P2 method requires a &lt;U&gt;fixed&lt;/U&gt; amount of memory. It is approximate but is usually quite accurate for quartiles.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jan 2019 22:20:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524186#M73487</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2019-01-02T22:20:52Z</dc:date>
    </item>
    <item>
      <title>Re: Calculating percentile for Huge data.</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524239#M73491</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/235185"&gt;@Ashok3395&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Another powerful alternative (for computing exact quartiles) is &lt;A href="https://documentation.sas.com/?docsetId=prochp&amp;amp;docsetTarget=prochp_hpbin_details01.htm&amp;amp;docsetVersion=9.4&amp;amp;locale=en" target="_blank"&gt;PROC HPBIN&lt;/A&gt;. For 30 million random numbers V1 (standard normal distribution) the memory usage of the step below was less than 1% of what PROC SUMMARY used. On top of that it was more than twice as fast on my machine.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;ods select none;
ods output quantile=qtl;
proc hpbin data=test computequantile;
input v1;
run;
ods select all;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Note, however, that PROC HPBIN uses the&amp;nbsp;percentile definition corresponding to &lt;A href="https://documentation.sas.com/?docsetId=procstat&amp;amp;docsetTarget=procstat_univariate_details14.htm&amp;amp;docsetVersion=9.4&amp;amp;locale=en" target="_blank"&gt;PCTLDEF=&lt;/A&gt;3 in PROC MEANS/PROC SUMMARY and PROC UNIVARIATE, whereas the default is PCTLDEF=5.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/235185"&gt;@Ashok3395&lt;/a&gt;&amp;nbsp;wrote:
&lt;P&gt;WARNING: The data set SASPL2.ALL_PERCENTILE may be incomplete.&amp;nbsp; When this step was stopped there were 0 observations and &lt;STRONG&gt;31&lt;/STRONG&gt;&amp;nbsp;variables.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I suspect that you requested quartiles for several variables and possibly also used CLASS variables.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In this case you could&amp;nbsp;cut down&amp;nbsp;memory usage by restricting the computations to one analysis variable at a time and avoiding CLASS variables (use BY or WHERE statements instead if possible).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Further, note that multi-threading (in PROC MEANS/PROC SUMMARY) requires more memory, see &lt;A href="https://documentation.sas.com/?docsetId=proc&amp;amp;docsetTarget=n1qnc9bddfvhzqn105kqitnf29cp.htm&amp;amp;docsetVersion=9.4&amp;amp;locale=en#p0sr73r5sqdysln1pj92fpax4ab2" target="_blank"&gt;NOTHREADS option&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Jan 2019 11:11:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/524239#M73491</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2019-01-03T11:11:19Z</dc:date>
    </item>
    <item>
      <title>Re: Calculating percentile for Huge data.</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/532141#M73761</link>
      <description>&lt;P&gt;&amp;nbsp;If you don't have ties and aren't generating the quantiles for many variables (so that you aren't spending lots of time/memory sorting), you could sort your data and identify the records where your quantiles occur simply by taking the 0th, 25th, 50th, 75th, and 100th positions in the sorted order.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;e.g. if N=30,000,000 and no ties then you can use the following code to get the quantiles:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;if _N_=1&amp;nbsp; or _N_=30000000 or _N_=ceil(.25*30000000) or _N_=ceil(.50*30000000) or _N_=ceil(.75*30000000);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you have ties, the issue is much more complicated and I'd go with the HPBIN approach or one of the other recommendations.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Also, creating quantiles by levels of other variables complicates determining the number of observations you'll have in your classes which you'd need to figure out and that would add time to this process. You could do that with some by processing which shouldn't be too time intensive; sort followed by two data steps; first data step with some by processing to figuring out numbers of observations in each class and then second data step to select the observations. I guess some book keeping would be needed to know _N_ values delineating the beginning and ending of each set of class records.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Feb 2019 18:19:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Calculating-percentile-for-Huge-data/m-p/532141#M73761</guid>
      <dc:creator>DWilson</dc:creator>
      <dc:date>2019-02-01T18:19:47Z</dc:date>
    </item>
  </channel>
</rss>

