How can I calculate a confidence interval for a binomial distributed proportion without using a dataset?
Is there a comman equivalent to the Stata cii n Y command?
I have the sample size and probability of success, and I need to calculate the CI.
Thank you.
If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.
here is the code from Rick's blog .
http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html
p = mean(x); /* Probability of success */ se = sqrt(p*(1-p) / (n-1)); /* standard error,n is the sample size */ z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */ LowerCL = p - z*se; UpperCL = p + z*se;
If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.
here is the code from Rick's blog .
http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html
p = mean(x); /* Probability of success */ se = sqrt(p*(1-p) / (n-1)); /* standard error,n is the sample size */ z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */ LowerCL = p - z*se; UpperCL = p + z*se;
great, thank you!
You can approximate the solution by creating a dataset that has the properties and then using proc freq.
I've included the code below and for comparisons @Ksharp s code so you can test your results.
Hope this helps.
data have;
n=2000;
p=0.7;
type='X';
X=floor(p*n); output;
type='Y'; x=n-x; output;
run;
data ci;
alpha=0.05;
n=2000;
p=0.7; /* Probability of success */
se = sqrt(p*(1-p) / (n-1)); /* standard error,n is the sample size */
z = quantile("Normal", 1-(alpha/2)); /* two-sided,if n is samll, try Binomial */
LowerCL = p - z*se;
UpperCL = p + z*se;
output;
run;
proc freq data=have;
table type/binomial (p=0.7);
weight x;
run;
thank you. This is what I was considering doing. It seems it might be a bit more straightforward.
It should be noted that the CI computed by the formula suggested is not identical to the similar CI computed by PROC FREQ. The latter is the well-known Wald CI with denominator n in the radicand, whereas the formula copied from @Rick_SAS's blog uses denominator n-1.
The standard error of point estimator p (sample proportion of successes) of binomial parameter P is sqrt(P*(1-P)/n) (Fleiss et al., Statistical Methods for Rates and Proportions, 3rd ed., p. 26). That's where denominator n comes from.
Denominator n-1 could be justified by the fact that n*p*(1-p)/(n-1) is the uniform minimum variance unbiased estimator of P*(1-P) (Lehmann/Casella, Theory of Point Estimation, 2nd ed., p. 100).
Obviously, with denominator n-1 the CI is wider, but only slightly so if n is large.
Just for curiosity I computed coverage probabilities for this CI ("CI2") and compared them to those of the Wald CI ("CI1") and of the Wald CI with continuity correction ("CI3"), p -/+ (z*sqrt(p*(1-p)/n)+1/(2*n)) (Fleiss et al., p. 29; z=probit(0.975)), all for confidence level 0.95. The investigation was restricted to 17<=n<=200 and P=k/10000 with integers k such that nP>=5 and n(1-P)>=5 (criterion from Fleiss et al., p. 26), a total of 1,590,280 pairs (n, P).
Some results:
So, at least for 17<=n<=200 the CI with denominator n-1 has advantages over the standard Wald CI. In many of these cases a continuity correction makes sense.
However, there are CIs for the binomial proportion with even better coverage properties without being overly conservative (as the Clopper-Pearson CI): see Brown et al. (2001), Interval Estimation for a Binomial Proportion, where also some of the above results on CI1 (and many more) can be found.
Below (under the spoiler tag) is part of the code I used for my investigations.
%macro binci(n);
/* Compute the three 95% CIs for all possible outcomes of n Bernoulli samples */
data ci95;
do k=0 to &n;
p0=k/&n;
z=probit(0.975);
d=z*sqrt(p0*(1-p0)/&n);
pL1=p0-d;
pU1=p0+d;
d=d+1/(2*&n);
pL3=p0-d;
pU3=p0+d;
d=z*sqrt(p0*(1-p0)/(&n-1));
pL2=p0-d;
pU2=p0+d;
output;
end;
drop p0 z d;
run;
/* Determine the range of true parameter p (times 10000) in which np>=5 and n(1-p)>=5 */
%let iL=%sysfunc(ceil(50000/&n));
%let iU=%eval(10000-&iL);
/* Calculate coverage probabilities depending on p, incrementing p by 0.0001 */
data tmp;
set ci95;
do i=&iL to &iU;
p=i/10000;
w1=(pL1 <= p <= pU1)*pdf('binom',k,p,&n);
w2=(pL2 <= p <= pU2)*pdf('binom',k,p,&n);
w3=(pL3 <= p <= pU3)*pdf('binom',k,p,&n);
output;
end;
drop i;
run;
proc summary data=tmp nway;
class p;
var w1-w3;
output out=covprob sum=;
run;
/* Compute descriptive statistics and create plot for coverage probabilities */
proc means data=covprob mean std min q1 median q3 max;
var w1-w3;
run;
goptions reset=all;
symbol1 v=x c=red h=0.9;
symbol2 v=plus c=orange h=0.6;
symbol3 v=dot c=green h=0.3;
axis1 order=(0.85 to 1 by 0.005)
label=(angle=90 'Coverage probability');
legend1 label=('CI') value=('Wald' 'Wald, but with denominator n-1' 'Wald with continuity correction')
across=1 mode=protect position=(top right inside) shape=symbol(3, .9);
title "Coverage Probabilities of 95% Confidence Intervals for Binomial Parameter p when n=&n, np>=5 and n(1-p)>=5";
proc gplot data=covprob;
plot w1*p=1 w2*p=2 w3*p=3 / overlay vaxis=axis1 vref=0.95 legend=legend1;
run;
quit;
title;
%mend binci;
%binci(17) /* n= 17, p=0.2942, 0.2943, 0.2944, ..., 0.7057, 0.7058 */
%binci(25) /* n= 25, p=0.2, 0.2001, 0.2002, ..., 0.7999, 0.8 */
%binci(50) /* n= 50, p=0.1, 0.1001, 0.1002, ..., 0.8999, 0.9 */
%binci(100) /* n=100, p=0.05, 0.0501, 0.0502, ..., 0.9499, 0.95 */
@FreelanceReinh and @Ksharp Sorry about that n-1 in the denominator. I suspect I copied and pasted that code from some earlier program that computed CIs for a different distribution. I intended to use the same formula that PROC FREQ uses. I will have to edit the blog post to correct that error. I suggest that the OP use the (correct!) Wald formula.
By the way, I wrote a blog post about the normal approximation to the binomial distribution, which is the basis for the CI formula. The experts might enjoy reading it.
After further thought, I no longer think there was an error in my original blog post. My blog post was about estimating a p-value for a Monte Carlo simulation. As shown in this PROC FREQ documentation, the standard error for the Monte Carlo estimate of a binomial proportion has an N-1 in the denominator. It hardly matters when you are doing 10,000 Monte Carlo simulations, but N-1 is the correct denominator. (The OP for this problem should still use N, since that is correct for the SE for the usual estimator p_hat = k/N.
thank you, everoyone. it's good to know there are options.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.