BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jlajla
Obsidian | Level 7

How can I calculate a confidence interval for a binomial distributed proportion without using a dataset?

 

Is there a comman equivalent to the Stata cii n Y command?

 

I have the sample size and probability of success, and I need to calculate the CI.

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.

here is the code from Rick's blog . 

http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html

 

p = mean(x);                       /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;

 

View solution in original post

8 REPLIES 8
Ksharp
Super User

If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.

here is the code from Rick's blog . 

http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html

 

p = mean(x);                       /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;

 

jlajla
Obsidian | Level 7

great, thank you!

Reeza
Super User

You can approximate the solution by creating a dataset that has the properties and then using proc freq. 

 

I've included the code below and for comparisons @Ksharp s code so you can test your results.

 

Hope this helps.

 

data have;
n=2000;
p=0.7;
type='X';
X=floor(p*n); output;
type='Y'; x=n-x; output;

run;

data ci;
alpha=0.05;
 n=2000;
 p=0.7;                     /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-(alpha/2)); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;
  output;
 run;

proc freq data=have;
table type/binomial (p=0.7);
weight x;
run;
 
jlajla
Obsidian | Level 7

thank you. This is what I was considering doing. It seems it might be a bit more straightforward.

FreelanceReinh
Jade | Level 19

It should be noted that the CI computed by the formula suggested is not identical to the similar CI computed by PROC FREQ. The latter is the well-known Wald CI with denominator n in the radicand, whereas the formula copied from @Rick_SAS's blog uses denominator n-1.

 

The standard error of point estimator p (sample proportion of successes) of binomial parameter P is sqrt(P*(1-P)/n) (Fleiss et al., Statistical Methods for Rates and Proportions, 3rd ed., p. 26). That's where denominator n comes from.

 

Denominator n-1 could be justified by the fact that n*p*(1-p)/(n-1) is the uniform minimum variance unbiased estimator of P*(1-P) (Lehmann/Casella, Theory of Point Estimation, 2nd ed., p. 100).

 

Obviously, with denominator n-1 the CI is wider, but only slightly so if n is large.

 

Just for curiosity I computed coverage probabilities for this CI ("CI2") and compared them to those of the Wald CI ("CI1") and of the Wald CI with continuity correction ("CI3"), p -/+ (z*sqrt(p*(1-p)/n)+1/(2*n)) (Fleiss et al., p. 29; z=probit(0.975)), all for confidence level 0.95. The investigation was restricted to 17<=n<=200 and P=k/10000 with integers k such that nP>=5 and n(1-P)>=5 (criterion from Fleiss et al., p. 26), a total of 1,590,280 pairs (n, P).

 

Some results:

  • Among the investigated 184 values of n, the proportion of cases (i.e. values of p) with coverage probabilities (CP) >=0.95 varied between 36.6% and 84.1% for CI3 (>=50% for each n>=24), but reached only 0.3 - 24.5% for CI1 and 4.5 - 28.1% for CI2.
  • Ranges of minimum coverage probabilities were [0.873, 0.894] for CI1, [0.874, 0.899] for CI2 and still only [0.896, 0.914] for CI3, in spite of the restrictions nP>=5 and n(1-P)>=5.
  • CP for CI1 and CI2 were equal in 83.6 - 94.8% of cases ("CP1=CP2").
  • For each n (14, 15, ..., 200) there were >1.5% cases with CP1 < CP2 <= 0.95. For some n, e.g. n=34, this proportion exceeded 10%.
  • The proportion of cases with CP1 < 0.95 < CP2 ranged from 1.9% to 11.1%. In most (>85%) of these cases CP2 was closer to 0.95 than CP1.
  • Not a single case was observed in which 0.95 <= CP1 < CP2.
  • In many cases (26.8 - 55.9%) CP2 was equal to CP3.
  • Proportions of cases with CP2 < 0.95 < CP3 and CP3 closer to 0.95 had a similar range: 22.4 - 50.7%.
  • The range of proportions of cases with 0.95 < CP2 < CP3 was only 0 - 8.0%.
  • For only 16 values of n the proportion of cases with either CP1 or CP2 in ]0.94, 0.96[ and CP3 outside this interval was greater than the proportion of cases showing the opposite behavior.
  • Overall, the proportions of cases with CP in ]0.94, 0.96[ were 61.3%, 65.8% and 71.6%, respectively (CP1, CP2, CP3).

 

So, at least for 17<=n<=200 the CI with denominator n-1 has advantages over the standard Wald CI. In many of these cases a continuity correction makes sense.

 

However, there are CIs for the binomial proportion with even better coverage properties without being overly conservative (as the Clopper-Pearson CI): see Brown et al. (2001), Interval Estimation for a Binomial Proportion, where also some of the above results on CI1 (and many more) can be found.

 

Below (under the spoiler tag) is part of the code I used for my investigations.

Spoiler
%macro binci(n);

/* Compute the three 95% CIs for all possible outcomes of n Bernoulli samples */

data ci95;
do k=0 to &n;
  p0=k/&n;
  z=probit(0.975);
  d=z*sqrt(p0*(1-p0)/&n);
  pL1=p0-d;
  pU1=p0+d;
  d=d+1/(2*&n);
  pL3=p0-d;
  pU3=p0+d;
  d=z*sqrt(p0*(1-p0)/(&n-1));
  pL2=p0-d;
  pU2=p0+d;
  output;
end;
drop p0 z d;
run;

/* Determine the range of true parameter p (times 10000) in which np>=5 and n(1-p)>=5 */

%let iL=%sysfunc(ceil(50000/&n));
%let iU=%eval(10000-&iL);

/* Calculate coverage probabilities depending on p, incrementing p by 0.0001 */

data tmp;
set ci95;
do i=&iL to &iU;
  p=i/10000;
  w1=(pL1 <= p <= pU1)*pdf('binom',k,p,&n);
  w2=(pL2 <= p <= pU2)*pdf('binom',k,p,&n);
  w3=(pL3 <= p <= pU3)*pdf('binom',k,p,&n);
  output;
end;
drop i;
run;

proc summary data=tmp nway;
class p;
var w1-w3;
output out=covprob sum=;
run;

/* Compute descriptive statistics and create plot for coverage probabilities */

proc means data=covprob mean std min q1 median q3 max;
var w1-w3;
run;

goptions reset=all;

symbol1 v=x    c=red    h=0.9;
symbol2 v=plus c=orange h=0.6;
symbol3 v=dot  c=green  h=0.3;
axis1 order=(0.85 to 1 by 0.005)
      label=(angle=90 'Coverage probability');

legend1 label=('CI') value=('Wald' 'Wald, but with denominator n-1' 'Wald with continuity correction')
        across=1 mode=protect position=(top right inside) shape=symbol(3, .9);

title "Coverage Probabilities of 95% Confidence Intervals for Binomial Parameter p when n=&n, np>=5 and n(1-p)>=5";

proc gplot data=covprob;
plot w1*p=1 w2*p=2 w3*p=3 / overlay vaxis=axis1 vref=0.95 legend=legend1;
run;
quit;

title;

%mend binci;

%binci(17)  /* n= 17, p=0.2942, 0.2943, 0.2944, ..., 0.7057, 0.7058 */
%binci(25)  /* n= 25, p=0.2,    0.2001, 0.2002, ..., 0.7999, 0.8    */
%binci(50)  /* n= 50, p=0.1,    0.1001, 0.1002, ..., 0.8999, 0.9    */
%binci(100) /* n=100, p=0.05,   0.0501, 0.0502, ..., 0.9499, 0.95   */
Rick_SAS
SAS Super FREQ

@FreelanceReinh and @Ksharp Sorry about that n-1 in the denominator.  I suspect I copied and pasted that code from some earlier program that computed CIs for a different distribution.  I intended to use the same formula that PROC FREQ uses. I will have to edit the blog post to correct that error.  I suggest that the OP use the (correct!) Wald formula.

 

By the way, I wrote a blog post about the normal approximation to the binomial distribution, which is the basis for the CI formula.  The experts might enjoy reading it.

Rick_SAS
SAS Super FREQ

After further thought, I no longer think there was an error in my original blog post.  My blog post was about estimating a p-value for a Monte Carlo simulation. As shown in this PROC FREQ documentation, the standard error for the Monte Carlo estimate of a binomial proportion has an N-1 in the denominator.  It hardly matters when you are doing 10,000 Monte Carlo simulations, but N-1 is the correct denominator.  (The OP for this problem should still use N, since that is correct for the SE for the usual estimator p_hat = k/N.

jlajla
Obsidian | Level 7

thank you, everoyone. it's good to know there are options.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2379 views
  • 3 likes
  • 5 in conversation