I followed the exact methodology from SAS 9.4 procedures guide: statistical procedures, third edition to calculate confidence interval for percentiles. However I have found a few discrepancies when I compared my calculated resutls with what was returned from SAS.
For example, I want to caculate the CI for 90%tile of the results. The percentile calculated using [np]+1 and it returned 2312.
I start with the par of order statistics: (2311,2313) and then (2310,2314), (2309,2315) and so on. The optimal pair of statistics is (2282,2342) with coverage probability of 0.9515646. However SAS returned the pair of the order statistics is (2281,2341). I don't know why this result is procuded instead of the symmetric pair of (2282,2342) around 2312. I am using CIPCTLDF=(lowerpre=LCL upperpre=UCL) options to get the distrbution-free confidence interval. The data I used can be found in the attachments. N=2568 and percentile=90% and alpha=0.05.
Method shown in the SAS procedure page:
The two-sided confidence limits for the th percentile are
where is the jth order statistic when the data values are arranged in increasing order:
The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around , where is the integer part of and nis the sample size. Furthermore, l and u are chosen so that and are as close to as possible while satisfying the coverage probability requirement,
where is the cumulative binomial probability
It would help to post the code for the procedure you used. Without that we a guessing as to which procedure and options actually specified.
Sure. Below is the code part I used to calculate the distribution-free confidence interval for 90%tile.
***read in csv file;
proc import out=testdata1 datafile="vec.csv" dbms=csv replace;
getnames=yes;
run;
***use the CIPCTLDF option to request distribution-free confidence limits for percentiles;
proc univariate data=testdata1 cipctldf;
var x;
output out=pctl1 pctlpts=90 pctlpre=p
cipctldf=(lowerpre=LCL upperpre=UCL);
run;
proc print noobs;run;endsas;
Results generated are shown below:
The SAS System
The UNIVARIATE Procedure
Variable: x
Moments
N 2568 Sum Weights 2568
Mean 237.121456 Sum Observations 608927.9
Std Deviation 2133.64526 Variance 4552442.11
Skewness 16.3191468 Kurtosis 309.929034
Uncorrected SS 1.18305E10 Corrected SS 1.16861E10
Coeff Variation 899.811133 Std Error Mean 42.1041308
Basic Statistical Measures
Location Variability
Mean 237.1215 Std Deviation 2134
Median 5.1000 Variance 4552442
Mode 0.0000 Range 50274
Interquartile Range 13.30000
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student's t t 5.631786 Pr > |t| <.0001
Sign M 1259 Pr >= |M| <.0001
Signed Rank S 1585711 Pr >= |S| <.0001
Quantiles (Definition 5)
95% Confidence Limits -------Order Statistics-------
Level Quantile Distribution Free LCL Rank UCL Rank Coverage
100% Max 50274.3
99% 4739.8 3266.9 7294.7 2532 2552 95.19
95% 335.0 231.2 529.9 2419 2463 95.29
90% 85.3 71.4 102.2 2281 2341 95.15
75% Q3 15.8 13.7 18.5 1883 1970 95.26
50% Median 5.1 4.8 5.4 1235 1335 95.15
25% Q1 2.5 2.4 2.6 599 686 95.26
10% 1.4 1.3 1.5 228 288 95.15
5% 0.8 0.7 0.9 106 150 95.29
1% 0.0 0.0 0.0 17 37 95.19
0% Min 0.0
The SAS System
The UNIVARIATE Procedure
Variable: x
Extreme Observations
----Lowest---- -----Highest-----
Value Obs Value Obs
0 2427 27810.0 1558
0 2426 35257.3 2338
0 2425 36921.4 2339
0 2167 48214.6 2340
0 2019 50274.3 1666
The SAS System
p90 LCL90 UCL90
85.3 71.4 102.2
I assume that you computed the CIs with a program and got a different answer? Please post the program that you uses to check the results.
Below is an R program to calculate distribution-free CI for percentiles. It returns an CI of (71.9,103.6), which corresponds to the pair of order statistics (2282,2342), for 90th percentile. While SAS returns an CI of (71.4,102.2) with corresponding pair of order statistics of (2281,2341).
calc_ci<- function(vec,percentile){
library(tidyverse)
vec_order<- sort(vec) ###sorted in ascending order
n<- length(vec)
mid<- floor(n*percentile)+1
###type 5: empirical distribution function with averaging
if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2
else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic
len<- min(mid,n-mid)
list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>%
mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile),
logic=prob>=1-alpha)
i_select<- min(list[list$logic,'i'])
u<- mid + i_select
l<- mid - i_select
ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u])
return(ci_per)
}
calc_ci(vec,0.9)
I don't speak R yet but I am never surprised when different programs, SAS vs R, return slightly different answers. There are so many things going into the algorithms in the background regarding precision, rounding and orders of operations.
I get the following error when I try to run your script :
vec = 1:100
calc_ci(vec,0.9)
Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
sorry, I forgot to set alpha to be 0.05.
###set alpha to be 0.05
alpha<- 0.05
calc_ci<- function(vec,percentile){
library(tidyverse)
vec_order<- sort(vec) ###sorted in ascending order
n<- length(vec)
mid<- floor(n*percentile)+1
###type 5: empirical distribution function with averaging
if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2
else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic
len<- min(mid,n-mid)
list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>%
mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile),
logic=prob>=1-alpha)
i_select<- min(list[list$logic,'i'])
u<- mid + i_select
l<- mid - i_select
ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u])
return(ci_per)
}
calc_ci(vec,0.9)
Although I do not have the answer to your question, I looked at this last night and have a few comments. I think there are two main issues:
1. The choice of the initial value for the lower and upper index.
2. The method by which the lower index is decremented and the upper index is incremented until the difference in binomial probability exceeds the target coverage probability. See the paragraph that contains Eqn 5.1.
I have attached a copy of the relevant pages from Hahn and Meeker (1991), which is the reference in the SAS doc. The description of the method of determining the interval is fairly vague, using terms like "nearly symmetrical" and "as close together as possible."
Your program uses floor(np)+1 as the initial index to start the incrementing/decrementing process. However, Hahn and Meeker propose using "integers l and u that are closest to p(n+1)." I interpret that as l=floor(p*(n+1)) and u=ceil(p*(n+1)) although it is not clear to me what to do if l=u, which seems to be forbidden by Eqn 5.1. Do you decrement l? Increment u? I don't know.
In your program, you follow Hahn and Meeker's sentence that says "For example, ...you can use l=i^{-} - j and u=i^{+} + j for j=1,2,3..., incrementing j until [Eqn 5.1] is satisfied." But their comment is only one way to implement the goal to "choose the integers l and u symmetrically (or nearly symmetrically) around p*(n+1) and as close together as possible." For example, an alternative is to test at each step whether decrementing l or incrementing u (and leaving the other unchanged) would satisfy Eqn 5.1. If so, that would lead to values that are closer together than changing both bounds in lockstep. I don't know what PROC UNIVARIATE is doing, but I suspect it is doing something more complicated, given that your program gives different answers.
Incidentally, the SAS interval (2281,2341) has a smaller coverage probability than the interval (2282,2342) that your program produces, so I guess that might indicate that UNIVARIATE is using a more sophisticated decrement/increment algorithm.
proc iml;
n = 2568;
p = 0.9;
mid = p*(n+1);
print mid;
l = 2281; u = 2341;
dQSAS = cdf("binomial", u-1, p, n) - cdf("binomial", l-1, p, n);
print l u dQSAS;
l = 2282; u = 2342;
dQAlt = cdf("binomial", u-1, p, n) - cdf("binomial", l-1, p, n);
print l u dQAlt;
Unfortunately, I will be out of the office for the next week, but please post any progress that you make. It is an interesting question.
Thank you Rick for your detailed response. Here are some of my thoughts on the difference identified:
1. You mentioned the choice of the intiial value for the lower and upper index may differ. That's right. I did started with Hahn and Meeker's guidance choosing l=floor(p*(n+1)) and u=ceil(p*(n+1)) but later on I saw the SAS theory page for calculating percentiles under The Univariate Procedure (http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_univaria...), which states that
The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around , where is the integer part of and n is the sample size. Furthermore,landuare chosen so thatandare as close toas possible while satisfying the coverage probability requirement,
l<- 2282 u<- 2342
alpha<- 0.05 percentile<- 0.9 n<- 2568 library(tidyverse) tab<- tibble("l_candidate"=c(l,l+1,l-1,l,l,l+1,l+1),"u_candidate"=c(u,u+1,u-1,u+1,u-1,u,u)) %>% mutate(prob=pbinom(q = u_candidate-1,size = n, prob = percentile) - pbinom(q = l_candidate-1,size = n, prob = percentile), logic=prob>=1-alpha) %>% filter(logic) %>% filter(prob==min(prob)) tab
# A tibble: 1 x 4
l_candidate u_candidate prob logic
<dbl> <dbl> <dbl> <lgl>
1 2283 2343 0.951 TRUE
> I am sure that SAS's algorithm did something different from what is stated in the theory page.
As I pointed out in my previous response, the description in the documentation and in Hahn/Meeker are both vague. There are several algorithms that would be consistent with the text.
I cannot provide any insight regarding the "hidden implementation," but if I think about it further, I will let you know.
Out of curiosity, what are you trying to accomplish? Is this a school project? A research paper?
Thank you so much for your input. I am trying to create an R function that is able to calculate the CI of percentiles and I wanna make sure I get the correct results. So SAS univariate procedure is chosen to be the gold standard.
Hi! I am looking into this problem now. Have you found out anything after the discussion in 2019?
I am trying to find initial parameters LLIM and ULIM for the metod I describe in
Fast and Accurate Calculation of Descriptive Statistics of Very Large Sets of Data
Many thanks in advance!
/Br AndersS
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.