Hi,
I am wondering how to complete 95% CI for a categorical variable such as race or income bracket.
I know how to do it for a binary variable:
proc univariate data=data cibasic(alpha=0.05); var binary; run;
But how do I do it for a categorical variable such as race?
So that I get the following
Race
White X, 95% CI x1-x2
Black Y, 95% CI Y1-Y2
Asian Z, 95% cI Z1-Z2
and so on.
THanks so much in advance,
Louise
/* Simulation to determine coverage probabilities of the following three confidence regions for
the parameter vector (p1, p2, p3) of a Mult(100, p1, p2, p3) (multinomial) distribution with
p1=0.1, p2=0.18, p3=0.72 (example parameter values taken from SAS Usage Note 32609)
(1) 95% Wald confidence intervals for the individual probabilities (as computed by PROC CATMOD
in SAS Usage Note 32609, but the same as available from PROC FREQ except for the [trivial]
limitation of confidence bounds to [0, 1] in PROC FREQ)
(2) 98.333...% Wald confidence intervals as above (Bonferroni adjustment: (1-0.05/3)*100%), as
also suggested in R.G. Miller, Simultaneous Statistical Inference, p. 216, formula 10
(3) Bonferroni-adjusted CIs suggested in formula 14, p. 217, ibid., based on chi-square statistics
*/
%let n=100; /* parameters of multinomial distribution */
%let p1=0.1;
%let p2=0.18;
%let p3=0.72;
%let c=3; /* number of categories */
%let a=0.05; /* alpha */
data sim;
call streaminit(271828);
z=probit(1-&a/2); /* normal quantile without Bonferroni adjustment */
b=probit(1-&a/(2*&c)); /* normal quantile with Bonferroni adjustment */
x=cinv(1-&a/&c,1); /* chi-square quantile, df=1, with Bonferroni adjustment */
array n[3]; /* samples of the multinomial distribution */
array lcl[3,3]; /* lower confidence limits (indices e.g. [2,3] for interval 2, param. p3) */
array ucl[3,3]; /* upper confidence limits (indices as above) */
array y[3]; /* 0-1 indicators for "(p1,p2,p3) contained in conf. region (1, 2, 3)" */
do i=1 to 1e7; /* sample size for the simulation */
n1=0; n2=0; n3=0; /* initialization */
do j=1 to &n; /* generate sample from multinomial distribution */
v=rand('table', &p1, &p2, &p3);
n[v]+1;
end;
do k=1 to 3; /* calculate CIs for p1, p2, p3 */
p=n[k]/&n; /* point estimate of probability "pk" (k=1, 2, 3) */
s=sqrt(p*(1-p)/&n); /* approx. standard error of the above point estimate */
lcl[1,k]=p-z*s; /* CI 1 */
ucl[1,k]=p+z*s;
lcl[2,k]=p-b*s; /* CI 2 */
ucl[2,k]=p+b*s;
lcl[3,k]=(x+2*n[k]-sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x)); /* CI 3 */
ucl[3,k]=(x+2*n[k]+sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x));
do m=1 to 3; /* check if confidence regions cover the true parameter vector */
y[m]=(lcl[m,1] <= &p1 <= ucl[m,1] &
lcl[m,2] <= &p2 <= ucl[m,2] &
lcl[m,3] <= &p3 <= ucl[m,3]);
end;
end;
output;
end;
keep y:;
run;
/* Estimate coverage probabilities of the three confidence regions */
ods select BinomialCLs;
proc freq data=sim;
tables y: / binomial(level='1' cl=exact);
run;
/* Results:
Point est. Clopper-Pearson 95% CI
(1) 0.8551 0.8549 0.8554
(2) 0.9402 0.9401 0.9404
(3) 0.9633 0.9632 0.9634
*/
(Actually, this could be coded more elegantly by restricting the calculations to the finitely many [<<1E7] possible outcomes of the particular multinomial distribution and weighing the results by the corresponding probabilities. But the run time of the above code was only less than 1 minute.)
So, for the particular multinomial distribution Mult(100, 0.1, 0.18, 0.72), only the third confidence region achieves the nominal 95% confidence level. The second comes close, but the first fails, as expected.
Conclusion: The PROC CATMOD approach presented in SAS Usage Note 32609 does not seem to provide substantial benefits over using PROC FREQ (here with Bonferroni adjustment):
ods exclude BinomialTest;
proc freq data=a;
weight count / zeros;
tables y / binomial (level='1') alpha=0.01666666666667;
tables y / binomial (level='2') alpha=0.01666666666667;
tables y / binomial (level='3') alpha=0.01666666666667;
run;
There are alternative confidence regions in the literature which can be implemented easily in a data step.
PROC UNIVARIATE only analyzes continuous variables. For discrete variables, you have to use other procedures such as FREQ, LOGISTIC, GENMOD, and CATMOD.
It sounds like you want the confidence intervals for a multinomial variable. An internet search reveals the following possibilities:
If you describe the source of your data, we might be able to make particular recommendations.
Hi Louise,
I see, Rick was faster. I also had a look at that SAS Usage Note 32609. Depending on whether you have a SAS/IML license, there is the "MULTINOM" or the PROC CATMOD approach.
In either case one could add SAS code to automate the process to some extent, so that you don't have to enter, e.g., the response statement for PROC CATMOD manually.
Also, I haven't checked yet if these CIs are identical to those in R.G. Miller, Simultaneous Statistical Inference, p. 217, eqn. 14. (Not sure if this is relevant to you.) Both refer to the Bonferroni adjustment.
Edit: I've checked this now. No, the definition in that classic reference is very different. Those CIs are not symmetric about the point estimate. Please see my next post for more details.
I have investigated the CIs proposed in SAS Usage Note 32609 with a primary focus on the PROC CATMOD approach as I don't have SAS/IML licensed.
It turned out that the CIs produced by PROC CATMOD (at least in the example given) are merely the same that could be produced by PROC FREQ using the BINOMIAL option of the TABLES statement: the simple, well-known approximate CI for a single proportion. I've observed only three marginal differences:
Given this comparison, I don't know why that SAS Usage Note suggests PROC CATMOD to perform a typical task for PROC FREQ.
Moreover, there seems to be an error in Example 2 of the document: It says that "Bonferroni adjusted 95% confidence intervals" are calculated using the "MULTINOM" module and they call this module with the appropriate parameter ("S") for simultaneous CIs, but the results presented in the table contain only unadjusted 95% CIs! These are identical to those computed with PROC CATMOD subsequently.
Both PROC FREQ (TABLES statement) and PROC CATMOD (MODEL statement) have the common ALPHA= option, so that the Bonferroni adjustment can be achieved by increasing the nominal confidence level (from 95% to 98.333...% in the example with 3 categories).
To demonstrate the difference between adjusted and unadjusted CI in terms of coverage probability and to compare these CIs to that (from R.G. Miller's monograph) mentioned in my previous post, I've performed a quick (run time <1 min) simulation. It involves 10 million random samples from one particular multinomial distribution. Please see my next post.
/* Simulation to determine coverage probabilities of the following three confidence regions for
the parameter vector (p1, p2, p3) of a Mult(100, p1, p2, p3) (multinomial) distribution with
p1=0.1, p2=0.18, p3=0.72 (example parameter values taken from SAS Usage Note 32609)
(1) 95% Wald confidence intervals for the individual probabilities (as computed by PROC CATMOD
in SAS Usage Note 32609, but the same as available from PROC FREQ except for the [trivial]
limitation of confidence bounds to [0, 1] in PROC FREQ)
(2) 98.333...% Wald confidence intervals as above (Bonferroni adjustment: (1-0.05/3)*100%), as
also suggested in R.G. Miller, Simultaneous Statistical Inference, p. 216, formula 10
(3) Bonferroni-adjusted CIs suggested in formula 14, p. 217, ibid., based on chi-square statistics
*/
%let n=100; /* parameters of multinomial distribution */
%let p1=0.1;
%let p2=0.18;
%let p3=0.72;
%let c=3; /* number of categories */
%let a=0.05; /* alpha */
data sim;
call streaminit(271828);
z=probit(1-&a/2); /* normal quantile without Bonferroni adjustment */
b=probit(1-&a/(2*&c)); /* normal quantile with Bonferroni adjustment */
x=cinv(1-&a/&c,1); /* chi-square quantile, df=1, with Bonferroni adjustment */
array n[3]; /* samples of the multinomial distribution */
array lcl[3,3]; /* lower confidence limits (indices e.g. [2,3] for interval 2, param. p3) */
array ucl[3,3]; /* upper confidence limits (indices as above) */
array y[3]; /* 0-1 indicators for "(p1,p2,p3) contained in conf. region (1, 2, 3)" */
do i=1 to 1e7; /* sample size for the simulation */
n1=0; n2=0; n3=0; /* initialization */
do j=1 to &n; /* generate sample from multinomial distribution */
v=rand('table', &p1, &p2, &p3);
n[v]+1;
end;
do k=1 to 3; /* calculate CIs for p1, p2, p3 */
p=n[k]/&n; /* point estimate of probability "pk" (k=1, 2, 3) */
s=sqrt(p*(1-p)/&n); /* approx. standard error of the above point estimate */
lcl[1,k]=p-z*s; /* CI 1 */
ucl[1,k]=p+z*s;
lcl[2,k]=p-b*s; /* CI 2 */
ucl[2,k]=p+b*s;
lcl[3,k]=(x+2*n[k]-sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x)); /* CI 3 */
ucl[3,k]=(x+2*n[k]+sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x));
do m=1 to 3; /* check if confidence regions cover the true parameter vector */
y[m]=(lcl[m,1] <= &p1 <= ucl[m,1] &
lcl[m,2] <= &p2 <= ucl[m,2] &
lcl[m,3] <= &p3 <= ucl[m,3]);
end;
end;
output;
end;
keep y:;
run;
/* Estimate coverage probabilities of the three confidence regions */
ods select BinomialCLs;
proc freq data=sim;
tables y: / binomial(level='1' cl=exact);
run;
/* Results:
Point est. Clopper-Pearson 95% CI
(1) 0.8551 0.8549 0.8554
(2) 0.9402 0.9401 0.9404
(3) 0.9633 0.9632 0.9634
*/
(Actually, this could be coded more elegantly by restricting the calculations to the finitely many [<<1E7] possible outcomes of the particular multinomial distribution and weighing the results by the corresponding probabilities. But the run time of the above code was only less than 1 minute.)
So, for the particular multinomial distribution Mult(100, 0.1, 0.18, 0.72), only the third confidence region achieves the nominal 95% confidence level. The second comes close, but the first fails, as expected.
Conclusion: The PROC CATMOD approach presented in SAS Usage Note 32609 does not seem to provide substantial benefits over using PROC FREQ (here with Bonferroni adjustment):
ods exclude BinomialTest;
proc freq data=a;
weight count / zeros;
tables y / binomial (level='1') alpha=0.01666666666667;
tables y / binomial (level='2') alpha=0.01666666666667;
tables y / binomial (level='3') alpha=0.01666666666667;
run;
There are alternative confidence regions in the literature which can be implemented easily in a data step.
If anyone is curious about how to simulate multinomial data in SAS, there are several options
Thank you so much everyone for your help!
I adapted the code to account for every single race value in my "race variable" and "missing" so that my 95% CI would account for missing values as well.
proc freq data=data;
tables race: / missing binomial(level='1' cl=exact);
tables race: / missing binomial(level='2' cl=exact);
tables race: / missing binomial(level='3' cl=exact);
tables race: / missing binomial(level='4' cl=exact);
tables race: / missing binomial(level='5' cl=exact);
tables race: / missing binomial(level='6' cl=exact);
run;
This is fine if your focus is on the individual confidence intervals rather than a simultaneous confidence region.
Adding reference for the future: compute simultaneous confidence intervals for multinomial proportions in SAS.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.