Hello,
I tried using the bootstrapping method described by Rick Wicklin here to calculate the 95% CI around my statistic of interest (the 99th percentile of a variable).
My sample size is about 3200 participants. I am planning to use 2000 replicates. However, when I ran the code (below), it returned CI that do not surround the initial 99th percentile. For example, the 99th percentile of the variable was 0.13 and the CI limits generated were 11.6 and 42.9. I think Any thoughts on what I did wrong here? I added p99 to step 2, which was not in the original example, but is my statistic of interest.
Thanks,
Sophie
proc means data=mydata p99;
var myvar;
run;
%let NumSamples = 2000; /* number of bootstrap resamples */
/* 1. Generate many bootstrap samples */
proc surveyselect data=mydata NOPRINT seed=1
out=BootSSFreq(rename=(Replicate=SampleID))
method=urs /* resample with replacement */
samprate=1 /* each bootstrap sample has N observations */
/*outhits*/ /* OUTHITS option to suppress the frequency var */
reps=&NumSamples; /* generate NumSamples bootstrap resamples */
run;
/* 2. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq p99 noprint;
by SampleID;
freq NumberHits;
var myvar;
output out=OutStats skew=Skewness; /* approx sampling distribution */
run;
/* 3. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
var Skewness;
output out=Pctl pctlpre =CI95_
pctlpts =2.5 97.5 /* compute 95% bootstrap confidence interval */
pctlname=Lower Upper;
run;
proc print data=Pctl noobs; run;
Rick's article was to find the interval around the SKEWNESS of a variable. So you copied his Proc Means Code asking for the same values
proc means data=BootSSFreq p99 noprint; by SampleID; freq NumberHits; var myvar; output out=OutStats skew=Skewness; /* approx sampling distribution */ run;
I think you want P99= and use that in univariate.
@sophiec wrote:
Hello,
I tried using the bootstrapping method described by Rick Wicklin here to calculate the 95% CI around my statistic of interest (the 99th percentile of a variable).
My sample size is about 3200 participants. I am planning to use 2000 replicates. However, when I ran the code (below), it returned CI that do not surround the initial 99th percentile. For example, the 99th percentile of the variable was 0.13 and the CI limits generated were 11.6 and 42.9. I think Any thoughts on what I did wrong here? I added p99 to step 2, which was not in the original example, but is my statistic of interest.
Thanks,
Sophie
proc means data=mydata p99;
var myvar;
run; %let NumSamples = 2000; /* number of bootstrap resamples */ /* 1. Generate many bootstrap samples */ proc surveyselect data=mydata NOPRINT seed=1 out=BootSSFreq(rename=(Replicate=SampleID)) method=urs /* resample with replacement */ samprate=1 /* each bootstrap sample has N observations */ /*outhits*/ /* OUTHITS option to suppress the frequency var */ reps=&NumSamples; /* generate NumSamples bootstrap resamples */ run; /* 2. Compute the statistic for each bootstrap sample */ proc means data=BootSSFreq p99 noprint; by SampleID; freq NumberHits; var myvar; output out=OutStats skew=Skewness; /* approx sampling distribution */ run; /* 3. Use approx sampling distribution to make statistical inferences */ proc univariate data=OutStats noprint; var Skewness; output out=Pctl pctlpre =CI95_ pctlpts =2.5 97.5 /* compute 95% bootstrap confidence interval */ pctlname=Lower Upper; run; proc print data=Pctl noobs; run;
I tried updating the code as follows, as I think what I was calculating before was the 95% CI for the skewness of the variable and not the 99th percentile.
If this is correct, can I calculate bootstrapped CIs for subgroups of my data (for example, by sex)?
Thanks,
Sophie
%let NumSamples = 2000; /* number of bootstrap resamples */
/* 1. Generate many bootstrap samples */
proc surveyselect data=cric NOPRINT seed=1
out=BootSSFreq(rename=(Replicate=SampleID))
method=urs /* resample with replacement */
samprate=1 /* each bootstrap sample has N observations */
outhits /* OUTHITS option to suppress the frequency var */
reps=&NumSamples; /* generate NumSamples bootstrap resamples */
run;
/* 2. Compute the statistic for each bootstrap sample */
proc means data=BootSSFreq p99 noprint;
by SampleID;
freq NumberHits;
var myvar;
output out=OutStats p99=percentile; /* approx sampling distribution */
run;
/* 3. Use approx sampling distribution to make statistical inferences */
proc univariate data=OutStats noprint;
var percentile;
output out=Pctl pctlpre =CI95_
pctlpts =2.5 97.5 /* compute 95% bootstrap confidence interval */
pctlname=Lower Upper;
run;
proc print data=Pctl noobs; run;
Use CLASS Varname; in the Proc Means and Proc Univariate to get subgroups. However depending on your distribution of values in the subgroup data you may need to change the sample size in Surveyselect to have large enough samples.
Rick's article was to find the interval around the SKEWNESS of a variable. So you copied his Proc Means Code asking for the same values
proc means data=BootSSFreq p99 noprint; by SampleID; freq NumberHits; var myvar; output out=OutStats skew=Skewness; /* approx sampling distribution */ run;
I think you want P99= and use that in univariate.
@sophiec wrote:
Hello,
I tried using the bootstrapping method described by Rick Wicklin here to calculate the 95% CI around my statistic of interest (the 99th percentile of a variable).
My sample size is about 3200 participants. I am planning to use 2000 replicates. However, when I ran the code (below), it returned CI that do not surround the initial 99th percentile. For example, the 99th percentile of the variable was 0.13 and the CI limits generated were 11.6 and 42.9. I think Any thoughts on what I did wrong here? I added p99 to step 2, which was not in the original example, but is my statistic of interest.
Thanks,
Sophie
proc means data=mydata p99;
var myvar;
run; %let NumSamples = 2000; /* number of bootstrap resamples */ /* 1. Generate many bootstrap samples */ proc surveyselect data=mydata NOPRINT seed=1 out=BootSSFreq(rename=(Replicate=SampleID)) method=urs /* resample with replacement */ samprate=1 /* each bootstrap sample has N observations */ /*outhits*/ /* OUTHITS option to suppress the frequency var */ reps=&NumSamples; /* generate NumSamples bootstrap resamples */ run; /* 2. Compute the statistic for each bootstrap sample */ proc means data=BootSSFreq p99 noprint; by SampleID; freq NumberHits; var myvar; output out=OutStats skew=Skewness; /* approx sampling distribution */ run; /* 3. Use approx sampling distribution to make statistical inferences */ proc univariate data=OutStats noprint; var Skewness; output out=Pctl pctlpre =CI95_ pctlpts =2.5 97.5 /* compute 95% bootstrap confidence interval */ pctlname=Lower Upper; run; proc print data=Pctl noobs; run;
Thank you! I realized that shortly after posting and updated the code (see above). However, I'm finding that the calculated 99th percentile is not centered within the 95% CI that it calculates.
For example, see the following 99th percentiles with 95% CI:
overall sample: 102.6 (55.0, 229.0)
subgroup 1: 55.0 (17.0, 617.7)
subgroup 2: 86.0 (36.7, 252.9)
subgroup 3: 187.4 (57.9, 291.8)
Thanks again!
I can't tell you for sure why you are getting that behavior. However, I will point out that, in general, bootstrapping is known to provide poor estimates of the sampling distribution of extreme order statistics. Textbook examples are the minimum and maximum. I suspect that will also be true for the 99th percentile, so I would encourage you to think carefully about whether to even use a bootstrap here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.