Solved: Re: Binomial Confidence Intervals without dataset

jlajla · Posted 02-01-2016 06:18 PM

How can I calculate a confidence interval for a binomial distributed proportion without using a dataset?

Is there a comman equivalent to the Stata cii n Y command?

I have the sample size and probability of success, and I need to calculate the CI.

Thank you.

Ksharp · Posted 02-01-2016 08:33 PM

If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.

here is the code from Rick's blog .

http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html

p = mean(x);                       /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;

View solution in original post

Ksharp · Posted 02-01-2016 08:33 PM

If your sample size is big ,and can use Normal Asymptoic distribution to calculate it . Otherwise , you need calculated the quantitle of binomial distribution.

here is the code from Rick's blog .

http://blogs.sas.com/content/iml/2015/10/28/simulation-exact-tables.html

p = mean(x);                       /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-alpha/2); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;

jlajla · Posted 02-01-2016 09:03 PM

great, thank you!

Reeza · Posted 02-01-2016 09:03 PM

You can approximate the solution by creating a dataset that has the properties and then using proc freq.

I've included the code below and for comparisons @Ksharp s code so you can test your results.

Hope this helps.

data have;
n=2000;
p=0.7;
type='X';
X=floor(p*n); output;
type='Y'; x=n-x; output;

run;

data ci;
alpha=0.05;
 n=2000;
 p=0.7;                     /* Probability of success */
   se = sqrt(p*(1-p) / (n-1));  /* standard error,n is the sample size  */
   z = quantile("Normal", 1-(alpha/2)); /* two-sided,if n is samll, try Binomial */
   LowerCL = p - z*se;
   UpperCL = p + z*se;
  output;
 run;

proc freq data=have;
table type/binomial (p=0.7);
weight x;
run;

jlajla · Posted 02-01-2016 09:06 PM

thank you. This is what I was considering doing. It seems it might be a bit more straightforward.

FreelanceReinh · Posted 02-11-2016 06:15 PM

It should be noted that the CI computed by the formula suggested is not identical to the similar CI computed by PROC FREQ. The latter is the well-known Wald CI with denominator n in the radicand, whereas the formula copied from @Rick_SAS's blog uses denominator n-1.

The standard error of point estimator p (sample proportion of successes) of binomial parameter P is sqrt(P*(1-P)/n) (Fleiss et al., Statistical Methods for Rates and Proportions, 3rd ed., p. 26). That's where denominator n comes from.

Denominator n-1 could be justified by the fact that n*p*(1-p)/(n-1) is the uniform minimum variance unbiased estimator of P*(1-P) (Lehmann/Casella, Theory of Point Estimation, 2nd ed., p. 100).

Obviously, with denominator n-1 the CI is wider, but only slightly so if n is large.

Just for curiosity I computed coverage probabilities for this CI ("CI2") and compared them to those of the Wald CI ("CI1") and of the Wald CI with continuity correction ("CI3"), p -/+ (z*sqrt(p*(1-p)/n)+1/(2*n)) (Fleiss et al., p. 29; z=probit(0.975)), all for confidence level 0.95. The investigation was restricted to 17<=n<=200 and P=k/10000 with integers k such that nP>=5 and n(1-P)>=5 (criterion from Fleiss et al., p. 26), a total of 1,590,280 pairs (n, P).

Some results:

Among the investigated 184 values of n, the proportion of cases (i.e. values of p) with coverage probabilities (CP) >=0.95 varied between 36.6% and 84.1% for CI3 (>=50% for each n>=24), but reached only 0.3 - 24.5% for CI1 and 4.5 - 28.1% for CI2.
Ranges of minimum coverage probabilities were [0.873, 0.894] for CI1, [0.874, 0.899] for CI2 and still only [0.896, 0.914] for CI3, in spite of the restrictions nP>=5 and n(1-P)>=5.
CP for CI1 and CI2 were equal in 83.6 - 94.8% of cases ("CP1=CP2").
For each n (14, 15, ..., 200) there were >1.5% cases with CP1 < CP2 <= 0.95. For some n, e.g. n=34, this proportion exceeded 10%.
The proportion of cases with CP1 < 0.95 < CP2 ranged from 1.9% to 11.1%. In most (>85%) of these cases CP2 was closer to 0.95 than CP1.
Not a single case was observed in which 0.95 <= CP1 < CP2.
In many cases (26.8 - 55.9%) CP2 was equal to CP3.
Proportions of cases with CP2 < 0.95 < CP3 and CP3 closer to 0.95 had a similar range: 22.4 - 50.7%.
The range of proportions of cases with 0.95 < CP2 < CP3 was only 0 - 8.0%.
For only 16 values of n the proportion of cases with either CP1 or CP2 in ]0.94, 0.96[ and CP3 outside this interval was greater than the proportion of cases showing the opposite behavior.
Overall, the proportions of cases with CP in ]0.94, 0.96[ were 61.3%, 65.8% and 71.6%, respectively (CP1, CP2, CP3).

So, at least for 17<=n<=200 the CI with denominator n-1 has advantages over the standard Wald CI. In many of these cases a continuity correction makes sense.

However, there are CIs for the binomial proportion with even better coverage properties without being overly conservative (as the Clopper-Pearson CI): see Brown et al. (2001), Interval Estimation for a Binomial Proportion, where also some of the above results on CI1 (and many more) can be found.

Below (under the spoiler tag) is part of the code I used for my investigations.

Spoiler

%macro binci(n);

/* Compute the three 95% CIs for all possible outcomes of n Bernoulli samples */

data ci95;
do k=0 to &n;
  p0=k/&n;
  z=probit(0.975);
  d=z*sqrt(p0*(1-p0)/&n);
  pL1=p0-d;
  pU1=p0+d;
  d=d+1/(2*&n);
  pL3=p0-d;
  pU3=p0+d;
  d=z*sqrt(p0*(1-p0)/(&n-1));
  pL2=p0-d;
  pU2=p0+d;
  output;
end;
drop p0 z d;
run;

/* Determine the range of true parameter p (times 10000) in which np>=5 and n(1-p)>=5 */

%let iL=%sysfunc(ceil(50000/&n));
%let iU=%eval(10000-&iL);

/* Calculate coverage probabilities depending on p, incrementing p by 0.0001 */

data tmp;
set ci95;
do i=&iL to &iU;
  p=i/10000;
  w1=(pL1 <= p <= pU1)*pdf('binom',k,p,&n);
  w2=(pL2 <= p <= pU2)*pdf('binom',k,p,&n);
  w3=(pL3 <= p <= pU3)*pdf('binom',k,p,&n);
  output;
end;
drop i;
run;

proc summary data=tmp nway;
class p;
var w1-w3;
output out=covprob sum=;
run;

/* Compute descriptive statistics and create plot for coverage probabilities */

proc means data=covprob mean std min q1 median q3 max;
var w1-w3;
run;

goptions reset=all;

symbol1 v=x    c=red    h=0.9;
symbol2 v=plus c=orange h=0.6;
symbol3 v=dot  c=green  h=0.3;
axis1 order=(0.85 to 1 by 0.005)
      label=(angle=90 'Coverage probability');

legend1 label=('CI') value=('Wald' 'Wald, but with denominator n-1' 'Wald with continuity correction')
        across=1 mode=protect position=(top right inside) shape=symbol(3, .9);

title "Coverage Probabilities of 95% Confidence Intervals for Binomial Parameter p when n=&n, np>=5 and n(1-p)>=5";

proc gplot data=covprob;
plot w1*p=1 w2*p=2 w3*p=3 / overlay vaxis=axis1 vref=0.95 legend=legend1;
run;
quit;

title;

%mend binci;

%binci(17)  /* n= 17, p=0.2942, 0.2943, 0.2944, ..., 0.7057, 0.7058 */
%binci(25)  /* n= 25, p=0.2,    0.2001, 0.2002, ..., 0.7999, 0.8    */
%binci(50)  /* n= 50, p=0.1,    0.1001, 0.1002, ..., 0.8999, 0.9    */
%binci(100) /* n=100, p=0.05,   0.0501, 0.0502, ..., 0.9499, 0.95   */

%macro binci(n); /* Compute the three 95% CIs for all possible outcomes of n Bernoulli samples */ data ci95; do k=0 to &n; p0=k/&n; z=probit(0.975); d=z*sqrt(p0*(1-p0)/&n); pL1=p0-d; pU1=p0+d; d=d+1/(2*&n); pL3=p0-d; pU3=p0+d; d=z*sqrt(p0*(1-p0)/(&n-1)); pL2=p0-d; pU2=p0+d; output; end; drop p0 z d; run; /* Determine the range of true parameter p (times 10000) in which np>=5 and n(1-p)>=5 */ %let iL=%sysfunc(ceil(50000/&n)); %let iU=%eval(10000-&iL); /* Calculate coverage probabilities depending on p, incrementing p by 0.0001 */ data tmp; set ci95; do i=&iL to &iU; p=i/10000; w1=(pL1 <= p <= pU1)*pdf('binom',k,p,&n); w2=(pL2 <= p <= pU2)*pdf('binom',k,p,&n); w3=(pL3 <= p <= pU3)*pdf('binom',k,p,&n); output; end; drop i; run; proc summary data=tmp nway; class p; var w1-w3; output out=covprob sum=; run; /* Compute descriptive statistics and create plot for coverage probabilities */ proc means data=covprob mean std min q1 median q3 max; var w1-w3; run; goptions reset=all; symbol1 v=x c=red h=0.9; symbol2 v=plus c=orange h=0.6; symbol3 v=dot c=green h=0.3; axis1 order=(0.85 to 1 by 0.005) label=(angle=90 'Coverage probability'); legend1 label=('CI') value=('Wald' 'Wald, but with denominator n-1' 'Wald with continuity correction') across=1 mode=protect position=(top right inside) shape=symbol(3, .9); title "Coverage Probabilities of 95% Confidence Intervals for Binomial Parameter p when n=&n, np>=5 and n(1-p)>=5"; proc gplot data=covprob; plot w1*p=1 w2*p=2 w3*p=3 / overlay vaxis=axis1 vref=0.95 legend=legend1; run; quit; title; %mend binci; %binci(17) /* n= 17, p=0.2942, 0.2943, 0.2944, ..., 0.7057, 0.7058 */ %binci(25) /* n= 25, p=0.2, 0.2001, 0.2002, ..., 0.7999, 0.8 */ %binci(50) /* n= 50, p=0.1, 0.1001, 0.1002, ..., 0.8999, 0.9 */ %binci(100) /* n=100, p=0.05, 0.0501, 0.0502, ..., 0.9499, 0.95 */

Rick_SAS · Posted 02-12-2016 06:31 AM

@FreelanceReinh and @Ksharp Sorry about that n-1 in the denominator. I suspect I copied and pasted that code from some earlier program that computed CIs for a different distribution. I intended to use the same formula that PROC FREQ uses. I will have to edit the blog post to correct that error. I suggest that the OP use the (correct!) Wald formula.

By the way, I wrote a blog post about the normal approximation to the binomial distribution, which is the basis for the CI formula. The experts might enjoy reading it.

Rick_SAS · Posted 02-14-2016 04:54 PM

After further thought, I no longer think there was an error in my original blog post. My blog post was about estimating a p-value for a Monte Carlo simulation. As shown in this PROC FREQ documentation, the standard error for the Monte Carlo estimate of a binomial proportion has an N-1 in the denominator. It hardly matters when you are doing 10,000 Monte Carlo simulations, but N-1 is the correct denominator. (The OP for this problem should still use N, since that is correct for the SE for the usual estimator p_hat = k/N.

jlajla · Posted 02-12-2016 08:40 AM

thank you, everoyone. it's good to know there are options.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!