BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
molla
Fluorite | Level 6

Hi

While doing percentile calculation using proc univariate,  I could not get the expected output,pls help me for the same.

 

PROC UNIVARIATE DATA=DATA_SET EXCLNPWGCT;
VAR VAR1;WEIGHT VAR1;
OUTPUT OUT=OUTDATASET PCTLPRE=P_PCTLPTS=12.5 25 27.5 50 62.5 75 87.5 100;
RUN;

 

the following are the values in the var1 column of the data_set table after excluding the zero and negative values:

19967.95
19271.69
16525.2
6885.5
3442.75

 


percentile       
12.5
25
37.5
50
62.5
75
87.5
100

 

expected o/p
5164.125
6885.5
11705.35
16525.2
17898.44412
19271.68824
19619.81912
19967.95

 

output coming
6885.5
16525.2
16525.2
19271.68824
19271.68824
19967.95
19967.95
19967.95

 

Appreciate your help.Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So did you create a dataset named Q with a variable named X for the IML code to read int?

 

data q;
  input x;
cards;
19967.95
19271.69
16525.2
6885.5
3442.75
;



proc iml;
/* Define function that returns the TYPE=7 sample quantiles. For more info, see   https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html*/
start GetRQuantiles(y, probs);
   x = colvec(y);
   call sort(x);
   N = nrow(x);       /* assume all values are nonmissing */
   
   p = colvec(probs);
   m = 1-p;
   j = floor(N*p + m);
   g = N*p + m - j;

   q = j(nrow(p), 1, x[N]);    /* if p=1, x[N]=return max(x) */
   idx = loc(p < 1);
   if ncol(idx) >0 then do;
      j = j[idx]; g = g[idx];
      q[idx] = (1-g)#x[j] + g#x[j+1];
   end;   
   return q;
finish;

use q; read all var "x"; close;       /* read sample into x */
p = {12.5, 25, 37.5, 50, 62.5, 75, 87.5, 100} / 100;  /* define probabilities */
q = GetRQuantiles(x, p);  /* sample quantiles */
print p q;

quit;

Output
image.png

 

View solution in original post

27 REPLIES 27
ballardw
Super User

The option PCTLDEF= sets which of five different methods are used to calculate percentiles. Read the documentation for the different meanings. Or just try PCTLDEF= 1 (through 4, 5 is the default method and you don't like that) and see if any of the results match expectations.

The presence of a WEIGHT variable significantly affects the results. I have to say that seeing the same variable on a VAR and WEIGHT statement makes me queasy.

 

 

molla
Fluorite | Level 6

to remove the zero and negative values I used weight option ,by trying PCTLDEF option also am not getting the required output

PaigeMiller
Diamond | Level 26

@molla wrote:
to remove the zero and negative values I used weight option ,by trying PCTLDEF option also am not getting the required output

Be specific: which PCTLDEF options did you try?

--
Paige Miller
molla
Fluorite | Level 6

I tried PCTLDEF = 1

molla
Fluorite | Level 6

I have tried with other options as well but not getting the exact output

ballardw
Super User

Here is an example of what sort of thing might happen when you have "weights" based on the variable you are requesting percentiles about. This uses the same input data set with and without weights.

Please look at the result and see if one of them comes closer to what you expect.

 

data example;
  do x= 1 to 25;
  output;
  end;
run;

title "with weight";
proc univariate data=example EXCLNPWGT;
 VAR x;
 WEIGHT x;
 OUTPUT OUT=OUTDATASET1 PCTLPRE=P_ PCTLPTS=12.5 25 27.5 50 62.5 75 87.5 100;
RUN;

title "without weight";
proc univariate data=example EXCLNPWGT;
 VAR x;
 OUTPUT OUT=OUTDATASET2 PCTLPRE=P_ PCTLPTS=12.5 25 27.5 50 62.5 75 87.5 100;
RUN;
title;

This is admittedly moderately "nice" data but the purpose is to illustrate that weights and percentiles are a somewhat complex issue and using the same value for the weight is problematic at best.

PaigeMiller
Diamond | Level 26

Try PCTLDEF = 2 and PCTLDEF=3 and PCTLDEF=4

--
Paige Miller
molla
Fluorite | Level 6
tried bt not getting the accurate results
Tom
Super User Tom
Super User

How do you expect it to assign distinct values for the 8 different precentiles when your input only has 5 different values?

Rick_SAS
SAS Super FREQ

The percentiles you list are Definition #7 in the Hyndman and Fan (1997) lists of nine definitions for sample quantiles. This is the default in R. SAS does not natively support Definition #7, but you can use a formula to obtain it. The article "Sample quantiles: A comparison of 9 definitions" provides the details and SAS code to compute all nine definitions.

 

The article contains the function definitions. Here's how you get your results:

proc iml;
/* get module from program at https://blogs.sas.com/content/iml/files/2017/05/QntlDef.txt */
load module=(SampQuantile);
use Q; read all var "x"; close;       
prob = {12.5,25,37.5,50,62.5,75,87.5,100} / 100;
def = 7;
quantile = SampQuantile(x, prob, def);
print def prob quantile;

Region Capture.png

 

Be sure the read the end of my article to understand that these definitions are all estimates for the underlying quantiles of the data-generating distribution. None is "more correct" than any other. As the samples size gets larger, the definitions all give similar estimates.

molla
Fluorite | Level 6
Hi,
Its some what complex,not able to understand can we do the same in proc freq or proc mean?
If yes pls let me know how to do it
Rick_SAS
SAS Super FREQ

OK, here's a simpler suggestion: use the default estimates from PROC UNIVARIATE or another SAS procedure.

SAS procedures provide five ways to estimate quantiles. There are other estimates, as discussed in my article. But all of these are valid estimates and, in practice, there is rarely a reason to prefer one estimate over the other. 

molla
Fluorite | Level 6
By using the five ways of sas to estimate quantiles am not getting the accurate output,not able to understand how to implement the way which u have explained

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 27 replies
  • 5131 views
  • 9 likes
  • 6 in conversation