Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- discrepancies in distribution-free confidence interval for percentiles

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 02-11-2019 03:28 PM
(2165 views)

I followed the exact methodology from SAS 9.4 procedures guide: statistical procedures, third edition to calculate confidence interval for percentiles. However I have found a few discrepancies when I compared my calculated resutls with what was returned from SAS.

For example, I want to caculate the CI for 90%tile of the results. The percentile calculated using [np]+1 and it returned 2312.

I start with the par of order statistics: (2311,2313) and then (2310,2314), (2309,2315) and so on. The optimal pair of statistics is (2282,2342) with coverage probability of 0.9515646. However SAS returned the pair of the order statistics is (2281,2341). I don't know why this result is procuded instead of the symmetric pair of (2282,2342) around 2312. I am using CIPCTLDF=(lowerpre=LCL upperpre=UCL) options to get the distrbution-free confidence interval. The data I used can be found in the attachments. N=2568 and percentile=90% and alpha=0.05.

Method shown in the SAS procedure page:

The two-sided confidence limits for the th percentile are

where is the jth order statistic when the data values are arranged in increasing order:

The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around , where is the integer part of and nis the sample size. Furthermore, l and u are chosen so that and are as close to as possible while satisfying the coverage probability requirement,

where is the cumulative binomial probability

12 REPLIES 12

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Sure. Below is the code part I used to calculate the distribution-free confidence interval for 90%tile.

***read in csv file;

proc import out=testdata1 datafile="vec.csv" dbms=csv replace;

getnames=yes;

run;

***use the CIPCTLDF option to request distribution-free confidence limits for percentiles;

proc univariate data=testdata1 cipctldf;

var x;

output out=pctl1 pctlpts=90 pctlpre=p

cipctldf=(lowerpre=LCL upperpre=UCL);

run;

proc print noobs;run;endsas;

Results generated are shown below:

The SAS System

The UNIVARIATE Procedure

Variable: x

Moments

N 2568 Sum Weights 2568

Mean 237.121456 Sum Observations 608927.9

Std Deviation 2133.64526 Variance 4552442.11

Skewness 16.3191468 Kurtosis 309.929034

Uncorrected SS 1.18305E10 Corrected SS 1.16861E10

Coeff Variation 899.811133 Std Error Mean 42.1041308

Basic Statistical Measures

Location Variability

Mean 237.1215 Std Deviation 2134

Median 5.1000 Variance 4552442

Mode 0.0000 Range 50274

Interquartile Range 13.30000

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t 5.631786 Pr > |t| <.0001

Sign M 1259 Pr >= |M| <.0001

Signed Rank S 1585711 Pr >= |S| <.0001

Quantiles (Definition 5)

95% Confidence Limits -------Order Statistics-------

Level Quantile Distribution Free LCL Rank UCL Rank Coverage

100% Max 50274.3

99% 4739.8 3266.9 7294.7 2532 2552 95.19

95% 335.0 231.2 529.9 2419 2463 95.29

90% 85.3 71.4 102.2 2281 2341 95.15

75% Q3 15.8 13.7 18.5 1883 1970 95.26

50% Median 5.1 4.8 5.4 1235 1335 95.15

25% Q1 2.5 2.4 2.6 599 686 95.26

10% 1.4 1.3 1.5 228 288 95.15

5% 0.8 0.7 0.9 106 150 95.29

1% 0.0 0.0 0.0 17 37 95.19

0% Min 0.0

The SAS System

The UNIVARIATE Procedure

Variable: x

Extreme Observations

----Lowest---- -----Highest-----

Value Obs Value Obs

0 2427 27810.0 1558

0 2426 35257.3 2338

0 2425 36921.4 2339

0 2167 48214.6 2340

0 2019 50274.3 1666

The SAS System

p90 LCL90 UCL90

85.3 71.4 102.2

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Below is an R program to calculate distribution-free CI for percentiles. It returns an CI of (71.9,103.6), which corresponds to the pair of order statistics (2282,2342), for 90th percentile. While SAS returns an CI of (71.4,102.2) with corresponding pair of order statistics of (2281,2341).

calc_ci<- function(vec,percentile){

library(tidyverse)

vec_order<- sort(vec) ###sorted in ascending order

n<- length(vec)

mid<- floor(n*percentile)+1

###type 5: empirical distribution function with averaging

if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2

else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic

len<- min(mid,n-mid)

list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>%

mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile),

logic=prob>=1-alpha)

i_select<- min(list[list$logic,'i'])

u<- mid + i_select

l<- mid - i_select

ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u])

return(ci_per)

}

calc_ci(vec,0.9)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I get the following error when I try to run your script :

vec = 1:100

calc_ci(vec,0.9)

**Error in mutate_impl(.data, dots) : ****Evaluation error: non-numeric argument to binary operator.**

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

sorry, I forgot to set alpha to be 0.05.

###set alpha to be 0.05

alpha<- 0.05

calc_ci<- function(vec,percentile){

library(tidyverse)

vec_order<- sort(vec) ###sorted in ascending order

n<- length(vec)

mid<- floor(n*percentile)+1

###type 5: empirical distribution function with averaging

if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2

else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic

len<- min(mid,n-mid)

list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>%

mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile),

logic=prob>=1-alpha)

i_select<- min(list[list$logic,'i'])

u<- mid + i_select

l<- mid - i_select

ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u])

return(ci_per)

}

calc_ci(vec,0.9)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Although I do not have the answer to your question, I looked at this last night and have a few comments. I think there are two main issues:

1. The choice of the initial value for the lower and upper index.

2. The method by which the lower index is decremented and the upper index is incremented until the difference in binomial probability exceeds the target coverage probability. See the paragraph that contains Eqn 5.1.

I have attached a copy of the relevant pages from Hahn and Meeker (1991), which is the reference in the SAS doc. The description of the method of determining the interval is fairly vague, using terms like "nearly symmetrical" and "as close together as possible."

Your program uses floor(np)+1 as the initial index to start the incrementing/decrementing process. However, Hahn and Meeker propose using "integers l and u that are closest to p(n+1)." I interpret that as l=floor(p*(n+1)) and u=ceil(p*(n+1)) although it is not clear to me what to do if l=u, which seems to be forbidden by Eqn 5.1. Do you decrement l? Increment u? I don't know.

In your program, you follow Hahn and Meeker's sentence that says "For example, ...you can use l=i^{-} - j and u=i^{+} + j for j=1,2,3..., incrementing j until [Eqn 5.1] is satisfied." But their comment is only one way to implement the goal to "choose the integers l and u symmetrically (or nearly symmetrically) around p*(n+1) and *as *close together as possible." For example, an alternative is to test at each step whether decrementing l or incrementing u (and leaving the other unchanged) would satisfy Eqn 5.1. If so, that would lead to values that are closer together than changing both bounds in lockstep. I don't know what PROC UNIVARIATE is doing, but I suspect it is doing something more complicated, given that your program gives different answers.

Incidentally, the SAS interval (2281,2341) has a smaller coverage probability than the interval (2282,2342) that your program produces, so I guess that might indicate that UNIVARIATE is using a more sophisticated decrement/increment algorithm.

```
proc iml;
n = 2568;
p = 0.9;
mid = p*(n+1);
print mid;
```

l = 2281; u = 2341;
dQSAS = cdf("binomial", u-1, p, n) - cdf("binomial", l-1, p, n);
print l u dQSAS;

l = 2282; u = 2342;
dQAlt = cdf("binomial", u-1, p, n) - cdf("binomial", l-1, p, n);
print l u dQAlt;

Unfortunately, I will be out of the office for the next week, but please post any progress that you make. It is an interesting question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you Rick for your detailed response. Here are some of my thoughts on the difference identified:

1. You mentioned the choice of the intiial value for the lower and upper index may differ. That's right. I did started with Hahn and Meeker's guidance choosing l=floor(p*(n+1)) and u=ceil(p*(n+1)) but later on I saw the SAS theory page for calculating percentiles under The Univariate Procedure (http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_univaria...), which states that

The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around , where is the integer part of and n is the sample size. Furthermore,landuare chosen so thatandare as close toas possible while satisfying the coverage probability requirement,

In order to align with what SAS does, I decided to choose floor(np)+1 as the initial index to start the incrementing/decrementing process.

2. In the example data case, yes, the SAS interval (2281,2341) did have a smaller coverage probability than the interval (2282,2342) that R program produces. Here is my puzzle. Assuming SAS implemented a more sophisticated algorithm then I don't know why the interval of (2283,2343) is not chosen since it coverage probability is closest to 0.95 compared to the other.

pair of order statistics (2282,2342) with coverage probability of 0.9515646 (R function result)

pair of order statistics (2281,2341) with coverage probability of 0.9514861 (SAS result).

pair of order statistics (2283,2343) with coverage probability of 0.9506712 (optimal minimal coverage probability).

Below is how I searched around the original symmetric CI in R:

l<- 2282 u<- 2342

alpha<- 0.05 percentile<- 0.9 n<- 2568 library(tidyverse) tab<- tibble("l_candidate"=c(l,l+1,l-1,l,l,l+1,l+1),"u_candidate"=c(u,u+1,u-1,u+1,u-1,u,u)) %>% mutate(prob=pbinom(q = u_candidate-1,size = n, prob = percentile) - pbinom(q = l_candidate-1,size = n, prob = percentile), logic=prob>=1-alpha) %>% filter(logic) %>% filter(prob==min(prob)) tab

# A tibble: 1 x 4

l_candidate u_candidate prob logic

<dbl> <dbl> <dbl> <lgl>

1 2283 2343 0.951 TRUE

3. I am sure that SAS's algorithm did something different from what is stated in the theory page. It would be greatly appreciated if you can provide any insight regarding the hidden implementation from SAS end.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

*> I am sure that SAS's algorithm did something different from what is stated in the theory page. *

As I pointed out in my previous response, the description in the documentation and in Hahn/Meeker are both vague. There are several algorithms that would be consistent with the text.

I cannot provide any insight regarding the "hidden implementation," but if I think about it further, I will let you know.

Out of curiosity, what are you trying to accomplish? Is this a school project? A research paper?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi! I am looking into this problem now. Have you found out anything after the discussion in 2019?

I am trying to find initial parameters LLIM and ULIM for the metod I describe in

Fast and Accurate Calculation of Descriptive Statistics of Very Large Sets of Data

Many thanks in advance!

/Br AndersS

Anders Sköllermo (Skollermo in English)

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.