BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
karthikmanimca
Calcite | Level 5

Dear Team,

I am new to statistics field. In my current assignment, I need to replicate the manual excel work with SAS excel output. I am facing below challenge, on getting the excel percentile value in SAS output.

 

My dataset is having a column named Sample with below 4 values. I am trying to get the 25th and 75th percentile value for the Sample column. Presently, team is deriving the 25th percentile using Excel function PERCENTILE(A2:A5,.25), and PERCENTILE(A2:A5,.75). 

 

                  karthikmanimca_0-1686899256955.png

 

Tried simple PROC MEANS and PROC UNIVERATE in SAS. Please find the code and outputs below,


Code and Outputs:

 

PROC PRINT DATA=DEMO;
    TITLE 'Output of Demo Dataset';
RUN;
       karthikmanimca_1-1686899982280.png
 
PROC MEANS DATA=WORK.DEMO P25 P75 NOPRINT;
    VAR SAMPLE;
    OUTPUT OUT = C_MEANS 
    P25    = C_P25
    P75    = C_P75;
RUN;
     karthikmanimca_2-1686900012491.png

 

PROC UNIVARIATE DATA=WORK.DEMO NOPRINT;
    VAR SAMPLE;
    OUTPUT OUT =C_UNIV
    PCTLPTS = 25, 75
    PCTLPRE = P_;
RUN;
    karthikmanimca_3-1686900044391.png

Sincere Thanks in advance.

Karthik Mani.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Here you can get all 9 definition percentiles by Rick's code.

 

/*
There 9 different definitions  percentile.
Check Rick's blog:
https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

And Rick used IML code to calculate all 9 definitions  percentile.
P.S. SAS  only compute 1-5 definitions  percentile.

The following code is written by Rick.You can see EXCEL used the 7th percentile.
*/

data Q;
input x ;
datalines;
0.1
0.48
0.98
1.68
;
proc iml;
/* implement sample quantiles in Hyndman and Fan (1996) */
start SampQuantile(y, probs={0.25, 0.5, 0.75}, Type=2);
   x = colvec(y);
   p = colvec(probs);
   if element(Type, {1 2 3 4 6}) then do;
      SASType = {3 5 2 1 . 4};  /* convert from Type to QNTLDEF */
      call qntl(q, x, p, SASType[Type]);
      return q;
   end;
   if any(x=.) then 
      x = x[loc(x ^= .)]; /* remove missing values */
   call sort(x);
   N = nrow(x);
   if Type=5 then 
      return quantileM(x, p, 0.5, 1/(2*N));
   else if Type=7 then
      return quantileM(x, p, 1-p, 0);
   else if Type=8 then
      return quantileM(x, p, (p+1)/3, 2/(3*N+1));
   else if Type=9 then
      return quantileM(x, p, (2*p+3)/8, 5/(8*N+2));
finish;

start quantileM(x, p, m, c);
   N = nrow(x);       /* assume nonmissing */
   j = floor(N*p + m);
   g = n*p + m - j;
   if all(p>=c & p<1-c) then     /* usual case */
      return (1-g)#x[j] + g#x[j+1];

   /* otherwise, handle extreme values of p */
   /* When p < c, use x[1]. When p = 1 - c, use x[N] */
   q = j(nrow(p),1,.);
   idx1 = loc(p <  c);           /* set these quantiles to x[1] */
   idxN = loc(p >= 1-c);         /* set these quantiles to x[N] */
   idx  = loc(p > c & p < 1-c);
   if ncol(idx1)>0 then   q[idx1] = x[1];
   if ncol(idxN)>0 then   q[idxN] = x[N];
   if ncol(idx) >0 then do;
      j = j[idx]; g = g[idx];
      q[idx] = (1-g)#x[j] + g#x[j+1];
   end;
   return q;
finish;
store module=(SampQuantile quantileM);
quit;

/* Driver SAS/IML program to compare all 9 methods  */
proc iml;
load module=(SampQuantile);
use Q; read all var "x"; close;       
prob = T(1:999)/1000;                 * fine mesh of prob values;
create Pctls var {"Type" "quantile" "prob" "x"};
do t = 1 to 9;
   quantile = SampQuantile(x, prob, t);
   Type = j(nrow(prob), 1, t);
   append;
end;
close;
quit;
proc print data=Pctls noobs;
   where prob = 0.25;                 /* compare 0.25 quantile for different definitions */
run;

Ksharp_0-1686915965171.png

 

View solution in original post

8 REPLIES 8
ballardw
Super User

If you read the documentation for Proc Means you will see that there are actual several different settings that related to Percentile calculations. Look at the option QNTLDEF on this page: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/proc/n1qnc9bddfvhzqn105kqitnf29cp.htm

and https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/proc/p0v0y1on1hbxukn0zqgsp5ky8hc0.htm#n096sxk...

for Methods Computing Quantile Statistics. Some of these options involve how ties are broken as

 

Showing pictures of spreadsheets is misleading. One, it is not actual values but a picture so unless we retype everything we can't actually use anything shown. Second we don't see the actual formula , as in actual cells selected or any other options. Third is we do not know what precision of values may actually exist for use. Consider that Excel typically defaults to two decimal places when assigning the cell as numeric. A value that displays as 0.48 might actually be 0.481 or 0.479 (or quite a range actually). So the values shown are just approximations of just what they may be. I don't know, and really don't care, which default algorithm Excel uses but it is pretty obviously different for determining quantiles.

 

The typical "beginning" definition of a quantile is the order or between two values (and how those values) depending on which quantile is requested.  The "median" of 4 values typically wold fall between second and third values. So one set of rules will average those two. Another set might weight based on the values, another might take one of the end points.

The P25 of 4 values would fall beween first and second, so the default SAS uses is the average , 0.29.

 

SAS also has a similar issue based on assigned display format. So a format that only displays two decimal places can round values. So pictures of SAS output can also be misleading.

 

Provide example data in the form of data step code pasted into a text box opened with the </> icon above the message window to preserve text layout, the message windows will remove white space and other things depending on code, and to differentiate code from discussion.

data have;
   input x;
datalines;
0.98
1.68
0.48
0.10
;

I tried running all 5 definitions of QNTLDEF (or PCTLDEF, same thing different name) in SAS. I do not get the result you show for Excel using any of them.

The closest I get is using this and it isn't too far from the Excel P75. But P25 is still right out.

proc means data=have qntldef=4 maxdec=3 p25 p75;
  var x;
run;

Which yields:

Analysis Variable : x
   
25th Pctl 75th Pctl
0.195 1.505

If you were creating output sets from Means to use print for display because you didn't like the number of decimals involved you might look into the MAXDEC option in proc means. That sets the maximum number of decimals to display.

karthikmanimca
Calcite | Level 5

Dear @ballardw,

 

Thanks for looking into my issue and for the possible solutions. Noted all your solutions, comments, inputs, and suggestions. However, my team is expecting the Percentile value from the SAS exactly same as Excel. 

 

Thanks and Regards,

Karthik Mani.

PaigeMiller
Diamond | Level 26

@karthikmanimca wrote:

 

 

Thanks for looking into my issue and for the possible solutions. Noted all your solutions, comments, inputs, and suggestions. However, my team is expecting the Percentile value from the SAS exactly same as Excel. 


Did you try the five different PROC UNIVARIATE options to see if they match what Excel produces?

 

Unfortunately, sometimes expecting Excel and SAS to match identically is going to lead to disappointment.

--
Paige Miller
karthikmanimca
Calcite | Level 5

Dear @PaigeMiller,

Thanks for the response. I am very new to Statistics and still learning. Will explore the PROC UNIVERATE procedure and its option.

Regards,

Karthik Mani.

Ksharp
Super User
/************
另外一种方法(SAS和R和Excel有相同的分位数):
https://blogs.sas.com/content/iml/2021/07/26/compare-quantiles-sas-r-python.html
************/

proc iml;
/* By default, R (and Julia, and some Python packages) uses
   Hyndman and Fan's Type=7 definition. Compute Type=7 sample quantiles.
*/
start GetQuantile7(y, probs);
   x = colvec(y);
   N = nrow(x);
   if N=1 then return (y);  /* handle the degenerate case, N=1 */
 
   /* remove missing values, if any */
   idx = loc(x^=.);
   if ncol(idx)=0 then  
      return (.);           /* all values are missing */
   else if ncol(idx)<N then do;
      x = x[idx,]; N = nrow(x);  /* remove missing */
   end;
 
   /* Main computation: Compute Type=7 sample quantile. 
      Estimate is a linear interpolation between x[j] and x[j+1]. */
   call sort(x);
   p = colvec(probs);
   m = 1-p;
   j = floor(N*p + m);      /* indices into sorted data values */
   g = N*p + m - j;         /* 0 <= g <= 1 for interpolation */
 
   q = j(nrow(p), 1, x[N]); /* if p=1, estimate by x[N]=max(x) */
   idx = loc(p < 1);
   if ncol(idx) >0 then do;
      j = j[idx]; g = g[idx];
      q[idx] = (1-g)#x[j] + g#x[j+1]; /* linear interpolation */
   end;   
   return q;
finish;
 
/* Compare the SAS and R default definitions.
   The differences between definitions are most apparent 
   for small samples that have large gaps between adjacent data values. */
x = {0.1 0.48 0.98 1.68 }`;
prob = {0.25, 0.75};
 
call qntl(SASDefaultQntl, x, prob);
RDefaultQntl = GetQuantile7(x, prob);
print prob SASDefaultQntl RDefaultQntl;
quit;

Ksharp_0-1686915726470.png

 

karthikmanimca
Calcite | Level 5

Dear @Ksharp,

Thank you so much for taking your time and wonderful code. I feel. we have the solution and will work for my scenario. Will keep you and every updated soon.

Regards,

Karthik Mani.

Ksharp
Super User

Here you can get all 9 definition percentiles by Rick's code.

 

/*
There 9 different definitions  percentile.
Check Rick's blog:
https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

And Rick used IML code to calculate all 9 definitions  percentile.
P.S. SAS  only compute 1-5 definitions  percentile.

The following code is written by Rick.You can see EXCEL used the 7th percentile.
*/

data Q;
input x ;
datalines;
0.1
0.48
0.98
1.68
;
proc iml;
/* implement sample quantiles in Hyndman and Fan (1996) */
start SampQuantile(y, probs={0.25, 0.5, 0.75}, Type=2);
   x = colvec(y);
   p = colvec(probs);
   if element(Type, {1 2 3 4 6}) then do;
      SASType = {3 5 2 1 . 4};  /* convert from Type to QNTLDEF */
      call qntl(q, x, p, SASType[Type]);
      return q;
   end;
   if any(x=.) then 
      x = x[loc(x ^= .)]; /* remove missing values */
   call sort(x);
   N = nrow(x);
   if Type=5 then 
      return quantileM(x, p, 0.5, 1/(2*N));
   else if Type=7 then
      return quantileM(x, p, 1-p, 0);
   else if Type=8 then
      return quantileM(x, p, (p+1)/3, 2/(3*N+1));
   else if Type=9 then
      return quantileM(x, p, (2*p+3)/8, 5/(8*N+2));
finish;

start quantileM(x, p, m, c);
   N = nrow(x);       /* assume nonmissing */
   j = floor(N*p + m);
   g = n*p + m - j;
   if all(p>=c & p<1-c) then     /* usual case */
      return (1-g)#x[j] + g#x[j+1];

   /* otherwise, handle extreme values of p */
   /* When p < c, use x[1]. When p = 1 - c, use x[N] */
   q = j(nrow(p),1,.);
   idx1 = loc(p <  c);           /* set these quantiles to x[1] */
   idxN = loc(p >= 1-c);         /* set these quantiles to x[N] */
   idx  = loc(p > c & p < 1-c);
   if ncol(idx1)>0 then   q[idx1] = x[1];
   if ncol(idxN)>0 then   q[idxN] = x[N];
   if ncol(idx) >0 then do;
      j = j[idx]; g = g[idx];
      q[idx] = (1-g)#x[j] + g#x[j+1];
   end;
   return q;
finish;
store module=(SampQuantile quantileM);
quit;

/* Driver SAS/IML program to compare all 9 methods  */
proc iml;
load module=(SampQuantile);
use Q; read all var "x"; close;       
prob = T(1:999)/1000;                 * fine mesh of prob values;
create Pctls var {"Type" "quantile" "prob" "x"};
do t = 1 to 9;
   quantile = SampQuantile(x, prob, t);
   Type = j(nrow(prob), 1, t);
   append;
end;
close;
quit;
proc print data=Pctls noobs;
   where prob = 0.25;                 /* compare 0.25 quantile for different definitions */
run;

Ksharp_0-1686915965171.png

 

karthikmanimca
Calcite | Level 5

Dear Ksharp,

Thanks for the solution and it worked for me. Kindly keep up the great work and support for the SAS community. Looking forward to learn more from your expertise.

Yours sincerely,

Karthik Mani.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1145 views
  • 11 likes
  • 4 in conversation