Help using Base SAS procedures

Binomial Confidence Intervals without dataset

Accepted Solution Solved
Reply
Contributor
Posts: 23
Accepted Solution

Binomial Confidence Intervals without dataset

[ Edited ]

How can I calculate a confidence interval for a binomial distributed proportion without using a dataset?

 

Is there a comman equivalent to the Stata cii n Y command?

 

I have the sample size and probability of success, and I need to calculate the CI.

 

Thank you.


Accepted Solutions
Solution
‎02-01-2016 09:02 PM
Super User
Posts: 10,020

Re: Binomial Confidence Intervals without dataset

If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.

here is the code from Rick's blog . 

http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html

 

p = mean(x);                       /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;

 

View solution in original post


All Replies
Solution
‎02-01-2016 09:02 PM
Super User
Posts: 10,020

Re: Binomial Confidence Intervals without dataset

If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.

here is the code from Rick's blog . 

http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html

 

p = mean(x);                       /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;

 

Contributor
Posts: 23

Re: Binomial Confidence Intervals without dataset

great, thank you!

Super User
Posts: 19,770

Re: Binomial Confidence Intervals without dataset

You can approximate the solution by creating a dataset that has the properties and then using proc freq. 

 

I've included the code below and for comparisons @Ksharp s code so you can test your results.

 

Hope this helps.

 

data have;
n=2000;
p=0.7;
type='X';
X=floor(p*n); output;
type='Y'; x=n-x; output;

run;

data ci;
alpha=0.05;
 n=2000;
 p=0.7;                     /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-(alpha/2)); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;
  output;
 run;

proc freq data=have;
table type/binomial (p=0.7);
weight x;
run;
 
Contributor
Posts: 23

Re: Binomial Confidence Intervals without dataset

thank you. This is what I was considering doing. It seems it might be a bit more straightforward.

Trusted Advisor
Posts: 1,117

Re: Binomial Confidence Intervals without dataset

It should be noted that the CI computed by the formula suggested is not identical to the similar CI computed by PROC FREQ. The latter is the well-known Wald CI with denominator n in the radicand, whereas the formula copied from @Rick_SAS's blog uses denominator n-1.

 

The standard error of point estimator p (sample proportion of successes) of binomial parameter P is sqrt(P*(1-P)/n) (Fleiss et al., Statistical Methods for Rates and Proportions, 3rd ed., p. 26). That's where denominator n comes from.

 

Denominator n-1 could be justified by the fact that n*p*(1-p)/(n-1) is the uniform minimum variance unbiased estimator of P*(1-P) (Lehmann/Casella, Theory of Point Estimation, 2nd ed., p. 100).

 

Obviously, with denominator n-1 the CI is wider, but only slightly so if n is large.

 

Just for curiosity I computed coverage probabilities for this CI ("CI2") and compared them to those of the Wald CI ("CI1") and of the Wald CI with continuity correction ("CI3"), p -/+ (z*sqrt(p*(1-p)/n)+1/(2*n)) (Fleiss et al., p. 29; z=probit(0.975)), all for confidence level 0.95. The investigation was restricted to 17<=n<=200 and P=k/10000 with integers k such that nP>=5 and n(1-P)>=5 (criterion from Fleiss et al., p. 26), a total of 1,590,280 pairs (n, P).

 

Some results:

  • Among the investigated 184 values of n, the proportion of cases (i.e. values of p) with coverage probabilities (CP) >=0.95 varied between 36.6% and 84.1% for CI3 (>=50% for each n>=24), but reached only 0.3 - 24.5% for CI1 and 4.5 - 28.1% for CI2.
  • Ranges of minimum coverage probabilities were [0.873, 0.894] for CI1, [0.874, 0.899] for CI2 and still only [0.896, 0.914] for CI3, in spite of the restrictions nP>=5 and n(1-P)>=5.
  • CP for CI1 and CI2 were equal in 83.6 - 94.8% of cases ("CP1=CP2").
  • For each n (14, 15, ..., 200) there were >1.5% cases with CP1 < CP2 <= 0.95. For some n, e.g. n=34, this proportion exceeded 10%.
  • The proportion of cases with CP1 < 0.95 < CP2 ranged from 1.9% to 11.1%. In most (>85%) of these cases CP2 was closer to 0.95 than CP1.
  • Not a single case was observed in which 0.95 <= CP1 < CP2.
  • In many cases (26.8 - 55.9%) CP2 was equal to CP3.
  • Proportions of cases with CP2 < 0.95 < CP3 and CP3 closer to 0.95 had a similar range: 22.4 - 50.7%.
  • The range of proportions of cases with 0.95 < CP2 < CP3 was only 0 - 8.0%.
  • For only 16 values of n the proportion of cases with either CP1 or CP2 in ]0.94, 0.96[ and CP3 outside this interval was greater than the proportion of cases showing the opposite behavior.
  • Overall, the proportions of cases with CP in ]0.94, 0.96[ were 61.3%, 65.8% and 71.6%, respectively (CP1, CP2, CP3).

 

So, at least for 17<=n<=200 the CI with denominator n-1 has advantages over the standard Wald CI. In many of these cases a continuity correction makes sense.

 

However, there are CIs for the binomial proportion with even better coverage properties without being overly conservative (as the Clopper-Pearson CI): see Brown et al. (2001), Interval Estimation for a Binomial Proportion, where also some of the above results on CI1 (and many more) can be found.

 

Below (under the spoiler tag) is part of the code I used for my investigations.

Spoiler
%macro binci(n);

/* Compute the three 95% CIs for all possible outcomes of n Bernoulli samples */

data ci95;
do k=0 to &n;
  p0=k/&n;
  z=probit(0.975);
  d=z*sqrt(p0*(1-p0)/&n);
  pL1=p0-d;
  pU1=p0+d;
  d=d+1/(2*&n);
  pL3=p0-d;
  pU3=p0+d;
  d=z*sqrt(p0*(1-p0)/(&n-1));
  pL2=p0-d;
  pU2=p0+d;
  output;
end;
drop p0 z d;
run;

/* Determine the range of true parameter p (times 10000) in which np>=5 and n(1-p)>=5 */

%let iL=%sysfunc(ceil(50000/&n));
%let iU=%eval(10000-&iL);

/* Calculate coverage probabilities depending on p, incrementing p by 0.0001 */

data tmp;
set ci95;
do i=&iL to &iU;
  p=i/10000;
  w1=(pL1 <= p <= pU1)*pdf('binom',k,p,&n);
  w2=(pL2 <= p <= pU2)*pdf('binom',k,p,&n);
  w3=(pL3 <= p <= pU3)*pdf('binom',k,p,&n);
  output;
end;
drop i;
run;

proc summary data=tmp nway;
class p;
var w1-w3;
output out=covprob sum=;
run;

/* Compute descriptive statistics and create plot for coverage probabilities */

proc means data=covprob mean std min q1 median q3 max;
var w1-w3;
run;

goptions reset=all;

symbol1 v=x    c=red    h=0.9;
symbol2 v=plus c=orange h=0.6;
symbol3 v=dot  c=green  h=0.3;
axis1 order=(0.85 to 1 by 0.005)
      label=(angle=90 'Coverage probability');

legend1 label=('CI') value=('Wald' 'Wald, but with denominator n-1' 'Wald with continuity correction')
        across=1 mode=protect position=(top right inside) shape=symbol(3, .9);

title "Coverage Probabilities of 95% Confidence Intervals for Binomial Parameter p when n=&n, np>=5 and n(1-p)>=5";

proc gplot data=covprob;
plot w1*p=1 w2*p=2 w3*p=3 / overlay vaxis=axis1 vref=0.95 legend=legend1;
run;
quit;

title;

%mend binci;

%binci(17)  /* n= 17, p=0.2942, 0.2943, 0.2944, ..., 0.7057, 0.7058 */
%binci(25)  /* n= 25, p=0.2,    0.2001, 0.2002, ..., 0.7999, 0.8    */
%binci(50)  /* n= 50, p=0.1,    0.1001, 0.1002, ..., 0.8999, 0.9    */
%binci(100) /* n=100, p=0.05,   0.0501, 0.0502, ..., 0.9499, 0.95   */
SAS Super FREQ
Posts: 3,752

Re: Binomial Confidence Intervals without dataset

Posted in reply to FreelanceReinhard

@FreelanceReinhard and @Ksharp Sorry about that n-1 in the denominator.  I suspect I copied and pasted that code from some earlier program that computed CIs for a different distribution.  I intended to use the same formula that PROC FREQ uses. I will have to edit the blog post to correct that error.  I suggest that the OP use the (correct!) Wald formula.

 

By the way, I wrote a blog post about the normal approximation to the binomial distribution, which is the basis for the CI formula.  The experts might enjoy reading it.

SAS Super FREQ
Posts: 3,752

Re: Binomial Confidence Intervals without dataset

After further thought, I no longer think there was an error in my original blog post.  My blog post was about estimating a p-value for a Monte Carlo simulation. As shown in this PROC FREQ documentation, the standard error for the Monte Carlo estimate of a binomial proportion has an N-1 in the denominator.  It hardly matters when you are doing 10,000 Monte Carlo simulations, but N-1 is the correct denominator.  (The OP for this problem should still use N, since that is correct for the SE for the usual estimator p_hat = k/N.

Contributor
Posts: 23

Re: Binomial Confidence Intervals without dataset

thank you, everoyone. it's good to know there are options.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 522 views
  • 3 likes
  • 5 in conversation