BookmarkSubscribeRSS Feed
swimmingfish4
Fluorite | Level 6

I followed the exact methodology from SAS 9.4 procedures guide: statistical procedures, third edition to calculate confidence interval for percentiles. However I have found a few discrepancies when I compared my calculated resutls with what was returned from SAS.

 

For example, I want to caculate the CI for 90%tile of the results. The percentile calculated using [np]+1 and it returned 2312.

I start with the par of order statistics: (2311,2313) and then (2310,2314), (2309,2315) and so on. The optimal pair of statistics is (2282,2342) with coverage probability of 0.9515646. However SAS returned the pair of the order statistics is (2281,2341). I don't know why this result is procuded instead of the symmetric pair of (2282,2342) around 2312. I am using CIPCTLDF=(lowerpre=LCL upperpre=UCL) options to get the distrbution-free confidence interval. The data I used can be found in the attachments. N=2568 and percentile=90% and alpha=0.05.

 

Method shown in the SAS procedure page:

The two-sided $100(1-\alpha )\% $ confidence limits for the $100p$th percentile are

\[  \begin{array}{lcl} \mbox{lower limit} &  = &  X_{(l)} \\ \mbox{upper limit} &  = &  X_{(u)} \end{array}  \]

where $X_{(j)}$ is the jth order statistic when the data values are arranged in increasing order:

\[  X_{(1)} \leq X_{(2)} \leq \ldots \leq X_{(n)}  \]

The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around $\lfloor np \rfloor +1$, where $\lfloor np \rfloor $ is the integer part of $np$ and nis the sample size. Furthermore, l and u are chosen so that $X_{(l)}$ and $X_{(u)}$ are as close to $X_{\lfloor np \rfloor +1}$ as possible while satisfying the coverage probability requirement,

\[  Q(u-1;n,p) - Q(l-1;n,p) \geq 1 - \alpha  \]

where $Q(k;n,p)$ is the cumulative binomial probability

12 REPLIES 12
ballardw
Super User

It would help to post the code for the procedure you used. Without that we a guessing as to which procedure and options actually specified.

swimmingfish4
Fluorite | Level 6

Sure. Below is the code part I used to calculate the distribution-free confidence interval for 90%tile.

 

***read in csv file;

proc import out=testdata1 datafile="vec.csv" dbms=csv replace;
getnames=yes;
run;

***use the CIPCTLDF option to request distribution-free confidence limits for percentiles;
proc univariate data=testdata1 cipctldf;
var x;
output out=pctl1 pctlpts=90 pctlpre=p
cipctldf=(lowerpre=LCL upperpre=UCL);
run;
proc print noobs;run;endsas;

 

Results generated are shown below:

 

The SAS System

The UNIVARIATE Procedure
Variable: x

Moments

N 2568 Sum Weights 2568
Mean 237.121456 Sum Observations 608927.9
Std Deviation 2133.64526 Variance 4552442.11
Skewness 16.3191468 Kurtosis 309.929034
Uncorrected SS 1.18305E10 Corrected SS 1.16861E10
Coeff Variation 899.811133 Std Error Mean 42.1041308


Basic Statistical Measures

Location Variability

Mean 237.1215 Std Deviation 2134
Median 5.1000 Variance 4552442
Mode 0.0000 Range 50274
Interquartile Range 13.30000


Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t 5.631786 Pr > |t| <.0001
Sign M 1259 Pr >= |M| <.0001
Signed Rank S 1585711 Pr >= |S| <.0001


Quantiles (Definition 5)

95% Confidence Limits -------Order Statistics-------
Level Quantile Distribution Free LCL Rank UCL Rank Coverage

100% Max 50274.3
99% 4739.8 3266.9 7294.7 2532 2552 95.19
95% 335.0 231.2 529.9 2419 2463 95.29
90% 85.3 71.4 102.2 2281 2341 95.15
75% Q3 15.8 13.7 18.5 1883 1970 95.26
50% Median 5.1 4.8 5.4 1235 1335 95.15
25% Q1 2.5 2.4 2.6 599 686 95.26
10% 1.4 1.3 1.5 228 288 95.15
5% 0.8 0.7 0.9 106 150 95.29
1% 0.0 0.0 0.0 17 37 95.19
0% Min 0.0


The SAS System

The UNIVARIATE Procedure
Variable: x

Extreme Observations

----Lowest---- -----Highest-----

Value Obs Value Obs

0 2427 27810.0 1558
0 2426 35257.3 2338
0 2425 36921.4 2339
0 2167 48214.6 2340
0 2019 50274.3 1666
The SAS System

p90 LCL90 UCL90

85.3 71.4 102.2

Rick_SAS
SAS Super FREQ

I assume that you computed the CIs with a program and got a different answer? Please post the program that you uses to check the results.

swimmingfish4
Fluorite | Level 6

Below is an R program to calculate distribution-free CI for percentiles. It returns an CI of (71.9,103.6), which corresponds to the pair of order statistics (2282,2342), for 90th percentile. While SAS returns an CI of (71.4,102.2) with corresponding pair of order statistics of (2281,2341).

 

 

calc_ci<- function(vec,percentile){

library(tidyverse)
vec_order<- sort(vec) ###sorted in ascending order
n<- length(vec)

mid<- floor(n*percentile)+1
###type 5: empirical distribution function with averaging
if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2
else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic

len<- min(mid,n-mid)
list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>%
mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile),
logic=prob>=1-alpha)
i_select<- min(list[list$logic,'i'])
u<- mid + i_select
l<- mid - i_select

ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u])
return(ci_per)
}

 

calc_ci(vec,0.9)

ballardw
Super User

I don't speak R yet but I am never surprised when different programs, SAS vs R, return slightly different answers. There are so many things going into the algorithms in the background regarding precision, rounding and orders of operations.

Rick_SAS
SAS Super FREQ

I get the following error when I try to run your script :

   vec = 1:100
   calc_ci(vec,0.9)


Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.

 

swimmingfish4
Fluorite | Level 6

sorry, I forgot to set alpha to be 0.05. 

 

###set alpha to be 0.05

alpha<- 0.05

calc_ci<- function(vec,percentile){

library(tidyverse)
vec_order<- sort(vec) ###sorted in ascending order
n<- length(vec)

mid<- floor(n*percentile)+1
###type 5: empirical distribution function with averaging
if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2
else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic

len<- min(mid,n-mid)
list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>%
mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile),
logic=prob>=1-alpha)
i_select<- min(list[list$logic,'i'])
u<- mid + i_select
l<- mid - i_select

ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u])
return(ci_per)
}

 

calc_ci(vec,0.9)

Rick_SAS
SAS Super FREQ

Although I do not have the answer to your question, I looked at this last night and have a few comments. I think there are two main issues:

1. The choice of the initial value for the lower and upper index.

2. The method by which the lower index is decremented and the upper index is incremented until the difference in binomial probability exceeds the target coverage probability. See the paragraph that contains Eqn 5.1.

 

I have attached a copy of the relevant pages from Hahn and Meeker (1991), which is the reference in the SAS doc. The description of the method of determining the interval is fairly vague, using terms like "nearly symmetrical" and "as close together as possible."

 

Your program uses floor(np)+1 as the initial index to start the incrementing/decrementing process. However, Hahn and Meeker propose using "integers l and u that are closest to p(n+1)."  I interpret that as l=floor(p*(n+1)) and u=ceil(p*(n+1)) although it is not clear to me what to do if l=u, which seems to be forbidden by Eqn 5.1. Do you decrement l? Increment u? I don't know.

 

In your program, you follow Hahn and Meeker's sentence that says "For example, ...you can use l=i^{-} - j and u=i^{+} + j for j=1,2,3..., incrementing j until [Eqn 5.1] is satisfied." But their comment is only one way to  implement the goal to "choose the integers l and u symmetrically (or nearly symmetrically) around p*(n+1) and as close together as possible." For example, an alternative is to test at each step whether decrementing l or incrementing u (and leaving the other unchanged) would satisfy Eqn 5.1. If so, that would lead to values that are closer together than changing both bounds in lockstep. I don't know what PROC UNIVARIATE is doing, but I suspect it is doing something more complicated, given that your program gives different answers.

 

Incidentally, the SAS interval (2281,2341) has a smaller coverage probability than the interval  (2282,2342) that your program produces, so I guess that might indicate that UNIVARIATE is using a more sophisticated decrement/increment algorithm.

 

proc iml;
n = 2568;
p = 0.9;
mid = p*(n+1);
print mid;
l = 2281; u = 2341; dQSAS = cdf("binomial", u-1, p, n) - cdf("binomial", l-1, p, n); print l u dQSAS;
l = 2282; u = 2342; dQAlt = cdf("binomial", u-1, p, n) - cdf("binomial", l-1, p, n); print l u dQAlt;

Unfortunately, I will be out of the office for the next week, but please post any progress that you make. It is an interesting question.

swimmingfish4
Fluorite | Level 6

Thank you Rick for your detailed response. Here are some of my thoughts on the difference identified:

 

1. You mentioned the choice of the intiial value for the lower and upper index may differ. That's right. I did started with Hahn and Meeker's guidance choosing l=floor(p*(n+1)) and u=ceil(p*(n+1)) but later on I saw the SAS theory page for calculating percentiles under The Univariate Procedure (http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_univaria...), which states that 

The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around $\lfloor np \rfloor +1$, where $\lfloor np \rfloor $ is the integer part of $np$ and is the sample size. Furthermore,landuare chosen so that$X_{(l)}$and$X_{(u)}$are as close to$X_{\lfloor np \rfloor +1}$as possible while satisfying the coverage probability requirement,

\[  Q(u-1;n,p) - Q(l-1;n,p) \geq 1 - \alpha  \]
In order to align with what SAS does, I decided to choose floor(np)+1 as the initial index to start the incrementing/decrementing process.
 
2. In the example data case, yes, the SAS interval (2281,2341) did have a smaller coverage probability than the interval  (2282,2342) that R program produces. Here is my puzzle. Assuming SAS implemented a more sophisticated algorithm then I don't know why the interval of (2283,2343) is not chosen since it coverage probability is closest to 0.95 compared to the other.
 
pair of order statistics (2282,2342) with coverage probability of 0.9515646 (R function result)
pair of order statistics (2281,2341) with coverage probability of 0.9514861 (SAS result).
pair of order statistics (2283,2343) with coverage probability of 0.9506712 (optimal minimal coverage probability).
 
Below is how I searched around the original symmetric CI in R:
l<- 2282
u<- 2342
alpha<- 0.05 percentile<- 0.9 n<- 2568 library(tidyverse) tab<- tibble("l_candidate"=c(l,l+1,l-1,l,l,l+1,l+1),"u_candidate"=c(u,u+1,u-1,u+1,u-1,u,u)) %>% mutate(prob=pbinom(q = u_candidate-1,size = n, prob = percentile) - pbinom(q = l_candidate-1,size = n, prob = percentile), logic=prob>=1-alpha) %>% filter(logic) %>% filter(prob==min(prob)) tab
# A tibble: 1 x 4
l_candidate u_candidate prob logic
<dbl> <dbl> <dbl> <lgl>
1 2283 2343 0.951 TRUE
3. I am sure that SAS's algorithm did something different from what is stated in the theory page. It would be greatly appreciated if you can provide any insight regarding the hidden implementation from SAS end.
Rick_SAS
SAS Super FREQ

> I am sure that SAS's algorithm did something different from what is stated in the theory page. 

As I pointed out in my previous response, the description in the documentation and in Hahn/Meeker are both vague. There are several algorithms that would be consistent with the text.

 

I cannot provide any insight regarding the "hidden implementation," but if I think about it further, I will let you know.

 

Out of curiosity, what are you trying to accomplish? Is this a school project? A research paper?

swimmingfish4
Fluorite | Level 6

Thank you so much for your input. I am trying to create an R function that is able to calculate the CI of percentiles and I wanna make sure I get the correct results. So SAS univariate procedure is chosen to be the gold standard.

 

AndersS
Pyrite | Level 9

Hi! I am looking into this problem now. Have you found out anything after the discussion in 2019?

 

I am trying to  find initial parameters LLIM and ULIM for the metod I describe in

Fast and Accurate Calculation of Descriptive Statistics of Very Large Sets of Data

 

Many thanks in advance!

/Br AndersS

Anders Sköllermo (Skollermo in English)

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 2166 views
  • 7 likes
  • 4 in conversation