About swimmingfish4

swimmingfish4 · ‎02-26-2019

Thank you so much for your input. I am trying to create an R function that is able to calculate the CI of percentiles and I wanna make sure I get the correct results. So SAS univariate procedure is chosen to be the gold standard.

swimmingfish4 · ‎02-26-2019

Thank you Rick for your detailed response. Here are some of my thoughts on the difference identified: 1. You mentioned the choice of the intiial value for the lower and upper index may differ. That's right. I did started with Hahn and Meeker's guidance choosing l=floor(p*(n+1)) and u=ceil(p*(n+1)) but later on I saw the SAS theory page for calculating percentiles under The Univariate Procedure (http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_univariate_details14.htm), which states that The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around , where is the integer part of and n is the sample size. Furthermore,landuare chosen so that and are as close to as possible while satisfying the coverage probability requirement, In order to align with what SAS does, I decided to choose floor(np)+1 as the initial index to start the incrementing/decrementing process. 2. In the example data case, yes, the SAS interval (2281,2341) did have a smaller coverage probability than the interval (2282,2342) that R program produces. Here is my puzzle. Assuming SAS implemented a more sophisticated algorithm then I don't know why the interval of (2283,2343) is not chosen since it coverage probability is closest to 0.95 compared to the other. pair of order statistics (2282,2342) with coverage probability of 0.9515646 (R function result) pair of order statistics (2281,2341) with coverage probability of 0.9514861 (SAS result). pair of order statistics (2283,2343) with coverage probability of 0.9506712 (optimal minimal coverage probability). Below is how I searched around the original symmetric CI in R: l<- 2282 u<- 2342 alpha<- 0.05 percentile<- 0.9 n<- 2568 library(tidyverse) tab<- tibble("l_candidate"=c(l,l+1,l-1,l,l,l+1,l+1),"u_candidate"=c(u,u+1,u-1,u+1,u-1,u,u)) %>% mutate(prob=pbinom(q = u_candidate-1,size = n, prob = percentile) - pbinom(q = l_candidate-1,size = n, prob = percentile), logic=prob>=1-alpha) %>% filter(logic) %>% filter(prob==min(prob)) tab # A tibble: 1 x 4 l_candidate u_candidate prob logic <dbl> <dbl> <dbl> <lgl> 1 2283 2343 0.951 TRUE 3. I am sure that SAS's algorithm did something different from what is stated in the theory page. It would be greatly appreciated if you can provide any insight regarding the hidden implementation from SAS end.

swimmingfish4 · ‎02-12-2019

sorry, I forgot to set alpha to be 0.05. ###set alpha to be 0.05 alpha<- 0.05 calc_ci<- function(vec,percentile){ library(tidyverse) vec_order<- sort(vec) ###sorted in ascending order n<- length(vec) mid<- floor(n*percentile)+1 ###type 5: empirical distribution function with averaging if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2 else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic len<- min(mid,n-mid) list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>% mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile), logic=prob>=1-alpha) i_select<- min(list[list$logic,'i']) u<- mid + i_select l<- mid - i_select ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u]) return(ci_per) } calc_ci(vec,0.9)

swimmingfish4 · ‎02-12-2019

Below is an R program to calculate distribution-free CI for percentiles. It returns an CI of (71.9,103.6), which corresponds to the pair of order statistics (2282,2342), for 90th percentile. While SAS returns an CI of (71.4,102.2) with corresponding pair of order statistics of (2281,2341). calc_ci<- function(vec,percentile){ library(tidyverse) vec_order<- sort(vec) ###sorted in ascending order n<- length(vec) mid<- floor(n*percentile)+1 ###type 5: empirical distribution function with averaging if (n*percentile==floor(n*percentile)) perc<- (vec_order[mid-1]+vec_order[mid])/2 else perc<- vec_order[mid] #exact version, percentile are calculated by choosing the order statistic len<- min(mid,n-mid) list<- data.frame(i=0:len) %>% mutate(lower=mid-i,upper=mid+i) %>% mutate(prob=pbinom(q = upper-1,size = n, prob = percentile) - pbinom(q = lower-1,size = n, prob = percentile), logic=prob>=1-alpha) i_select<- min(list[list$logic,'i']) u<- mid + i_select l<- mid - i_select ci_per<- data.frame("p"=perc,"lcl"=vec_order[l],"ucl"=vec_order[u]) return(ci_per) } calc_ci(vec,0.9)

swimmingfish4 · ‎02-11-2019

Sure. Below is the code part I used to calculate the distribution-free confidence interval for 90%tile. ***read in csv file; proc import out=testdata1 datafile="vec.csv" dbms=csv replace; getnames=yes; run; ***use the CIPCTLDF option to request distribution-free confidence limits for percentiles; proc univariate data=testdata1 cipctldf; var x; output out=pctl1 pctlpts=90 pctlpre=p cipctldf=(lowerpre=LCL upperpre=UCL); run; proc print noobs;run;endsas; Results generated are shown below: The SAS System The UNIVARIATE Procedure Variable: x Moments N 2568 Sum Weights 2568 Mean 237.121456 Sum Observations 608927.9 Std Deviation 2133.64526 Variance 4552442.11 Skewness 16.3191468 Kurtosis 309.929034 Uncorrected SS 1.18305E10 Corrected SS 1.16861E10 Coeff Variation 899.811133 Std Error Mean 42.1041308 Basic Statistical Measures Location Variability Mean 237.1215 Std Deviation 2134 Median 5.1000 Variance 4552442 Mode 0.0000 Range 50274 Interquartile Range 13.30000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 5.631786 Pr > |t| <.0001 Sign M 1259 Pr >= |M| <.0001 Signed Rank S 1585711 Pr >= |S| <.0001 Quantiles (Definition 5) 95% Confidence Limits -------Order Statistics------- Level Quantile Distribution Free LCL Rank UCL Rank Coverage 100% Max 50274.3 99% 4739.8 3266.9 7294.7 2532 2552 95.19 95% 335.0 231.2 529.9 2419 2463 95.29 90% 85.3 71.4 102.2 2281 2341 95.15 75% Q3 15.8 13.7 18.5 1883 1970 95.26 50% Median 5.1 4.8 5.4 1235 1335 95.15 25% Q1 2.5 2.4 2.6 599 686 95.26 10% 1.4 1.3 1.5 228 288 95.15 5% 0.8 0.7 0.9 106 150 95.29 1% 0.0 0.0 0.0 17 37 95.19 0% Min 0.0 The SAS System The UNIVARIATE Procedure Variable: x Extreme Observations ----Lowest---- -----Highest----- Value Obs Value Obs 0 2427 27810.0 1558 0 2426 35257.3 2338 0 2425 36921.4 2339 0 2167 48214.6 2340 0 2019 50274.3 1666 The SAS System p90 LCL90 UCL90 85.3 71.4 102.2

swimmingfish4 · ‎02-11-2019

I followed the exact methodology from SAS 9.4 procedures guide: statistical procedures, third edition to calculate confidence interval for percentiles. However I have found a few discrepancies when I compared my calculated resutls with what was returned from SAS. For example, I want to caculate the CI for 90%tile of the results. The percentile calculated using [np]+1 and it returned 2312. I start with the par of order statistics: (2311,2313) and then (2310,2314), (2309,2315) and so on. The optimal pair of statistics is (2282,2342) with coverage probability of 0.9515646. However SAS returned the pair of the order statistics is (2281,2341). I don't know why this result is procuded instead of the symmetric pair of (2282,2342) around 2312. I am using CIPCTLDF=(lowerpre=LCL upperpre=UCL) options to get the distrbution-free confidence interval. The data I used can be found in the attachments. N=2568 and percentile=90% and alpha=0.05. Method shown in the SAS procedure page: The two-sided confidence limits for the th percentile are where is the jth order statistic when the data values are arranged in increasing order: The lower rank l and upper rank u are integers that are symmetric (or nearly symmetric) around , where is the integer part of and nis the sample size. Furthermore, l and u are chosen so that and are as close to as possible while satisfying the coverage probability requirement, where is the cumulative binomial probability

Online Status	Offline
Date Last Visited	‎04-18-2019 03:16 PM

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

discrepancies in distribution-free confidence interval for percentiles

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

discrepancies in distribution-free confidence interval for percentiles

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

Re: discrepancies in distribution-free confidence interval for percent...

discrepancies in distribution-free confidence interval for percentiles