Solved: Calculating bootstrapped 95% CI for 99th percentile of a variable

sophiec · Posted 06-06-2024 11:30 AM

Hello,

I tried using the bootstrapping method described by Rick Wicklin here to calculate the 95% CI around my statistic of interest (the 99th percentile of a variable).

My sample size is about 3200 participants. I am planning to use 2000 replicates. However, when I ran the code (below), it returned CI that do not surround the initial 99th percentile. For example, the 99th percentile of the variable was 0.13 and the CI limits generated were 11.6 and 42.9. I think Any thoughts on what I did wrong here? I added p99 to step 2, which was not in the original example, but is my statistic of interest.

Thanks,

Sophie

proc means data=mydata p99; 
var myvar; 
run; 

%let NumSamples = 2000;       /* number of bootstrap resamples */
/* 1. Generate many bootstrap samples */
proc surveyselect data=mydata NOPRINT seed=1
     out=BootSSFreq(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /*outhits*/ 			 /* OUTHITS option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 2. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq p99 noprint;
   by SampleID;
   freq NumberHits;
   var myvar;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */
run;

/* 3. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
   var Skewness;
   output out=Pctl pctlpre =CI95_
          pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
          pctlname=Lower Upper;
run;
 
proc print data=Pctl noobs; run;

ballardw · Posted 06-06-2024 11:58 AM

Rick's article was to find the interval around the SKEWNESS of a variable. So you copied his Proc Means Code asking for the same values

proc means data=BootSSFreq p99 noprint;
   by SampleID;
   freq NumberHits;
   var myvar;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */
run;

I think you want P99= and use that in univariate.

@sophiec wrote:

Hello,

I tried using the bootstrapping method described by Rick Wicklin here to calculate the 95% CI around my statistic of interest (the 99th percentile of a variable).

My sample size is about 3200 participants. I am planning to use 2000 replicates. However, when I ran the code (below), it returned CI that do not surround the initial 99th percentile. For example, the 99th percentile of the variable was 0.13 and the CI limits generated were 11.6 and 42.9. I think Any thoughts on what I did wrong here? I added p99 to step 2, which was not in the original example, but is my statistic of interest.

Thanks,

Sophie
proc means data=mydata p99; 
var myvar; 
run; 

%let NumSamples = 2000;       /* number of bootstrap resamples */
/* 1. Generate many bootstrap samples */
proc surveyselect data=mydata NOPRINT seed=1
     out=BootSSFreq(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /*outhits*/ 			 /* OUTHITS option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 2. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq p99 noprint;
   by SampleID;
   freq NumberHits;
   var myvar;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */
run;

/* 3. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
   var Skewness;
   output out=Pctl pctlpre =CI95_
          pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
          pctlname=Lower Upper;
run;
 
proc print data=Pctl noobs; run;

View solution in original post

sophiec · Posted 06-06-2024 11:50 AM

I tried updating the code as follows, as I think what I was calculating before was the 95% CI for the skewness of the variable and not the 99th percentile.

If this is correct, can I calculate bootstrapped CIs for subgroups of my data (for example, by sex)?

Thanks,

Sophie


%let NumSamples = 2000; /* number of bootstrap resamples */
/* 1. Generate many bootstrap samples */
proc surveyselect data=cric NOPRINT seed=1
out=BootSSFreq(rename=(Replicate=SampleID))
method=urs /* resample with replacement */
samprate=1 /* each bootstrap sample has N observations */
outhits /* OUTHITS option to suppress the frequency var */
reps=&NumSamples; /* generate NumSamples bootstrap resamples */
run;

/* 2. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq p99 noprint;
by SampleID;
freq NumberHits;
var myvar;
output out=OutStats p99=percentile; /* approx sampling distribution */
run;

/* 3. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
var percentile;
output out=Pctl pctlpre =CI95_
pctlpts =2.5 97.5 /* compute 95% bootstrap confidence interval */
pctlname=Lower Upper;
run;

proc print data=Pctl noobs; run;

ballardw · Posted 06-06-2024 12:05 PM

Use CLASS Varname; in the Proc Means and Proc Univariate to get subgroups. However depending on your distribution of values in the subgroup data you may need to change the sample size in Surveyselect to have large enough samples.

ballardw · Posted 06-06-2024 11:58 AM

Rick's article was to find the interval around the SKEWNESS of a variable. So you copied his Proc Means Code asking for the same values

proc means data=BootSSFreq p99 noprint;
   by SampleID;
   freq NumberHits;
   var myvar;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */
run;

I think you want P99= and use that in univariate.

@sophiec wrote:

Hello,

I tried using the bootstrapping method described by Rick Wicklin here to calculate the 95% CI around my statistic of interest (the 99th percentile of a variable).

My sample size is about 3200 participants. I am planning to use 2000 replicates. However, when I ran the code (below), it returned CI that do not surround the initial 99th percentile. For example, the 99th percentile of the variable was 0.13 and the CI limits generated were 11.6 and 42.9. I think Any thoughts on what I did wrong here? I added p99 to step 2, which was not in the original example, but is my statistic of interest.

Thanks,

Sophie
proc means data=mydata p99; 
var myvar; 
run; 

%let NumSamples = 2000;       /* number of bootstrap resamples */
/* 1. Generate many bootstrap samples */
proc surveyselect data=mydata NOPRINT seed=1
     out=BootSSFreq(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /*outhits*/ 			 /* OUTHITS option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 2. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq p99 noprint;
   by SampleID;
   freq NumberHits;
   var myvar;
   output out=OutStats skew=Skewness;  /* approx sampling distribution */
run;

/* 3. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
   var Skewness;
   output out=Pctl pctlpre =CI95_
          pctlpts =2.5  97.5       /* compute 95% bootstrap confidence interval */
          pctlname=Lower Upper;
run;
 
proc print data=Pctl noobs; run;

sophiec · Posted 06-06-2024 12:48 PM

Thank you! I realized that shortly after posting and updated the code (see above). However, I'm finding that the calculated 99th percentile is not centered within the 95% CI that it calculates.

For example, see the following 99th percentiles with 95% CI:

overall sample: 102.6 (55.0, 229.0)

subgroup 1: 55.0 (17.0, 617.7)

subgroup 2: 86.0 (36.7, 252.9)

subgroup 3: 187.4 (57.9, 291.8)

Thanks again!

Mike_N · Posted 06-06-2024 03:46 PM

I can't tell you for sure why you are getting that behavior. However, I will point out that, in general, bootstrapping is known to provide poor estimates of the sampling distribution of extreme order statistics. Textbook examples are the minimum and maximum. I suspect that will also be true for the 99th percentile, so I would encourage you to think carefully about whether to even use a bootstrap here.

Calculating bootstrapped 95% CI for 99th percentile of a variable

Re: Calculating bootstrapped 95% CI for 99th percentile of a variable

Re: Calculating bootstrapped 95% CI for 99th percentile of a variable

Re: Calculating bootstrapped 95% CI for 99th percentile of a variable

Re: Calculating bootstrapped 95% CI for 99th percentile of a variable

Re: Calculating bootstrapped 95% CI for 99th percentile of a variable

Re: Calculating bootstrapped 95% CI for 99th percentile of a variable

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!