Programming the statistical procedures from SAS

95% CI for Categorical Variables

Accepted Solution Solved
Reply
Occasional Contributor LLW
Occasional Contributor
Posts: 15
Accepted Solution

95% CI for Categorical Variables

Hi,

 

I am wondering how to complete 95% CI for a categorical variable such as race or income bracket.

I know how to do it for a binary variable:

 

proc univariate data=data  cibasic(alpha=0.05);   var binary; run;

 

But how do I do it for a categorical variable such as race?

 

So that I get the following

 

Race

White X, 95% CI x1-x2

Black Y, 95% CI Y1-Y2

Asian Z, 95% cI Z1-Z2

and so on.

 

THanks so much in advance,

Louise

 

 


Accepted Solutions
Solution
‎02-12-2016 11:00 PM
Trusted Advisor
Posts: 1,114

Re: 95% CI for Categorical Variables

/* Simulation to determine coverage probabilities of the following three confidence regions for
   the parameter vector (p1, p2, p3) of a Mult(100, p1, p2, p3) (multinomial) distribution with
   p1=0.1, p2=0.18, p3=0.72 (example parameter values taken from SAS Usage Note 32609)

   (1) 95% Wald confidence intervals for the individual probabilities (as computed by PROC CATMOD
       in SAS Usage Note 32609, but the same as available from PROC FREQ except for the [trivial]
       limitation of confidence bounds to [0, 1] in PROC FREQ)
   (2) 98.333...% Wald confidence intervals as above (Bonferroni adjustment: (1-0.05/3)*100%), as
       also suggested in R.G. Miller, Simultaneous Statistical Inference, p. 216, formula 10
   (3) Bonferroni-adjusted CIs suggested in formula 14, p. 217, ibid., based on chi-square statistics
*/

%let n=100;   /* parameters of multinomial distribution */
%let p1=0.1;
%let p2=0.18;
%let p3=0.72;
%let c=3;     /* number of categories */
%let a=0.05;  /* alpha */

data sim;
call streaminit(271828);
z=probit(1-&a/2);       /* normal quantile without Bonferroni adjustment */ 
b=probit(1-&a/(2*&c));  /* normal quantile with Bonferroni adjustment */
x=cinv(1-&a/&c,1);      /* chi-square quantile, df=1, with Bonferroni adjustment */
array n[3];             /* samples of the multinomial distribution */
array lcl[3,3];         /* lower confidence limits (indices e.g. [2,3] for interval 2, param. p3) */
array ucl[3,3];         /* upper confidence limits (indices as above) */
array y[3];             /* 0-1 indicators for "(p1,p2,p3) contained in conf. region (1, 2, 3)" */
do i=1 to 1e7;          /* sample size for the simulation */
  n1=0; n2=0; n3=0;     /* initialization */
  do j=1 to &n;         /* generate sample from multinomial distribution */
    v=rand('table', &p1, &p2, &p3);
    n[v]+1;
  end;
  do k=1 to 3;          /* calculate CIs for p1, p2, p3 */
    p=n[k]/&n;          /* point estimate of probability "pk" (k=1, 2, 3) */
    s=sqrt(p*(1-p)/&n); /* approx. standard error of the above point estimate */
    lcl[1,k]=p-z*s;     /* CI 1 */
    ucl[1,k]=p+z*s;
    lcl[2,k]=p-b*s;     /* CI 2 */
    ucl[2,k]=p+b*s;
    lcl[3,k]=(x+2*n[k]-sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x)); /* CI 3 */
    ucl[3,k]=(x+2*n[k]+sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x));
    do m=1 to 3;        /* check if confidence regions cover the true parameter vector */
      y[m]=(lcl[m,1] <= &p1 <= ucl[m,1] &
            lcl[m,2] <= &p2 <= ucl[m,2] &
            lcl[m,3] <= &p3 <= ucl[m,3]);
    end;
  end;
  output;
end;
keep y:;
run;

/* Estimate coverage probabilities of the three confidence regions */

ods select BinomialCLs;
proc freq data=sim;
tables y: / binomial(level='1' cl=exact);
run;

/* Results:

     Point est.    Clopper-Pearson 95% CI
                   
 (1) 0.8551        0.8549         0.8554
 (2) 0.9402        0.9401         0.9404
 (3) 0.9633        0.9632         0.9634
*/

(Actually, this could be coded more elegantly by restricting the calculations to the finitely many [<<1E7] possible outcomes of the particular multinomial distribution and weighing the results by the corresponding probabilities. But the run time of the above code was only less than 1 minute.)

 

So, for the particular multinomial distribution Mult(100, 0.1, 0.18, 0.72), only the third confidence region achieves the nominal 95% confidence level. The second comes close, but the first fails, as expected.

 

Conclusion: The PROC CATMOD approach presented in SAS Usage Note 32609 does not seem to provide substantial benefits over using PROC FREQ (here with Bonferroni adjustment):

ods exclude BinomialTest;
proc freq data=a;
weight count / zeros;
tables y / binomial (level='1') alpha=0.01666666666667;
tables y / binomial (level='2') alpha=0.01666666666667;
tables y / binomial (level='3') alpha=0.01666666666667;
run;

There are alternative confidence regions in the literature which can be implemented easily in a data step.

 

View solution in original post


All Replies
SAS Super FREQ
Posts: 3,310

Re: 95% CI for Categorical Variables

PROC UNIVARIATE only analyzes continuous variables.  For discrete variables, you have to use other procedures such as FREQ, LOGISTIC, GENMOD, and CATMOD.

 

It sounds like you want the confidence intervals for a multinomial variable. An internet search reveals the following possibilities:

If you describe the source of your data, we might be able to make particular recommendations.

Trusted Advisor
Posts: 1,114

Re: 95% CI for Categorical Variables

[ Edited ]

Hi Louise,

 

I see, Rick was faster. I also had a look at that SAS Usage Note 32609. Depending on whether you have a SAS/IML license, there is the "MULTINOM" or the PROC CATMOD approach.

 

In either case one could add SAS code to automate the process to some extent, so that you don't have to enter, e.g., the response statement for PROC CATMOD manually.

 

Also, I haven't checked yet if these CIs are identical to those in R.G. Miller, Simultaneous Statistical Inference, p. 217, eqn. 14. (Not sure if this is relevant to you.) Both refer to the Bonferroni adjustment.

Edit: I've checked this now. No, the definition in that classic reference is very different. Those CIs are not symmetric about the point estimate. Please see my next post for more details.

Trusted Advisor
Posts: 1,114

Re: 95% CI for Categorical Variables

I have investigated the CIs proposed in SAS Usage Note 32609 with a primary focus on the PROC CATMOD approach as I don't have SAS/IML licensed.

 

It turned out that the CIs produced by PROC CATMOD (at least in the example given) are merely the same that could be produced by PROC FREQ using the BINOMIAL option of the TABLES statement: the simple, well-known approximate CI for a single proportion. I've observed only three marginal differences:

  1. PROC FREQ, unlike PROC CATMOD, limits the confidence bounds to [0, 1].
  2. Both procedures have difficulties with zero frequencies when using default settings.
    PROC FREQ has the ZEROS option of the WEIGHT statement to produce the degenerate interval [0, 0] in this case.
    There doesn't seem to exist an equivalent option in PROC CATMOD.
  3. PROC FREQ is much faster (but appears to be still slower than calculating the CI in a data step).

Given this comparison, I don't know why that SAS Usage Note suggests PROC CATMOD to perform a typical task for PROC FREQ.

Moreover, there seems to be an error in Example 2 of the document: It says that "Bonferroni adjusted 95% confidence intervals" are calculated using the "MULTINOM" module and they call this module with the appropriate parameter ("S") for simultaneous CIs, but the results presented in the table contain only unadjusted 95% CIs! These are identical to those computed with PROC CATMOD subsequently.

 

Both PROC FREQ (TABLES statement) and PROC CATMOD (MODEL statement) have the common ALPHA= option, so that the Bonferroni adjustment can be achieved by increasing the nominal confidence level (from 95% to 98.333...% in the example with 3 categories).

 

To demonstrate the difference between adjusted and unadjusted CI in terms of coverage probability and to compare these CIs to that (from R.G. Miller's monograph) mentioned in my previous post, I've performed a quick (run time <1 min) simulation. It involves 10 million random samples from one particular multinomial distribution. Please see my next post.

Solution
‎02-12-2016 11:00 PM
Trusted Advisor
Posts: 1,114

Re: 95% CI for Categorical Variables

/* Simulation to determine coverage probabilities of the following three confidence regions for
   the parameter vector (p1, p2, p3) of a Mult(100, p1, p2, p3) (multinomial) distribution with
   p1=0.1, p2=0.18, p3=0.72 (example parameter values taken from SAS Usage Note 32609)

   (1) 95% Wald confidence intervals for the individual probabilities (as computed by PROC CATMOD
       in SAS Usage Note 32609, but the same as available from PROC FREQ except for the [trivial]
       limitation of confidence bounds to [0, 1] in PROC FREQ)
   (2) 98.333...% Wald confidence intervals as above (Bonferroni adjustment: (1-0.05/3)*100%), as
       also suggested in R.G. Miller, Simultaneous Statistical Inference, p. 216, formula 10
   (3) Bonferroni-adjusted CIs suggested in formula 14, p. 217, ibid., based on chi-square statistics
*/

%let n=100;   /* parameters of multinomial distribution */
%let p1=0.1;
%let p2=0.18;
%let p3=0.72;
%let c=3;     /* number of categories */
%let a=0.05;  /* alpha */

data sim;
call streaminit(271828);
z=probit(1-&a/2);       /* normal quantile without Bonferroni adjustment */ 
b=probit(1-&a/(2*&c));  /* normal quantile with Bonferroni adjustment */
x=cinv(1-&a/&c,1);      /* chi-square quantile, df=1, with Bonferroni adjustment */
array n[3];             /* samples of the multinomial distribution */
array lcl[3,3];         /* lower confidence limits (indices e.g. [2,3] for interval 2, param. p3) */
array ucl[3,3];         /* upper confidence limits (indices as above) */
array y[3];             /* 0-1 indicators for "(p1,p2,p3) contained in conf. region (1, 2, 3)" */
do i=1 to 1e7;          /* sample size for the simulation */
  n1=0; n2=0; n3=0;     /* initialization */
  do j=1 to &n;         /* generate sample from multinomial distribution */
    v=rand('table', &p1, &p2, &p3);
    n[v]+1;
  end;
  do k=1 to 3;          /* calculate CIs for p1, p2, p3 */
    p=n[k]/&n;          /* point estimate of probability "pk" (k=1, 2, 3) */
    s=sqrt(p*(1-p)/&n); /* approx. standard error of the above point estimate */
    lcl[1,k]=p-z*s;     /* CI 1 */
    ucl[1,k]=p+z*s;
    lcl[2,k]=p-b*s;     /* CI 2 */
    ucl[2,k]=p+b*s;
    lcl[3,k]=(x+2*n[k]-sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x)); /* CI 3 */
    ucl[3,k]=(x+2*n[k]+sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x));
    do m=1 to 3;        /* check if confidence regions cover the true parameter vector */
      y[m]=(lcl[m,1] <= &p1 <= ucl[m,1] &
            lcl[m,2] <= &p2 <= ucl[m,2] &
            lcl[m,3] <= &p3 <= ucl[m,3]);
    end;
  end;
  output;
end;
keep y:;
run;

/* Estimate coverage probabilities of the three confidence regions */

ods select BinomialCLs;
proc freq data=sim;
tables y: / binomial(level='1' cl=exact);
run;

/* Results:

     Point est.    Clopper-Pearson 95% CI
                   
 (1) 0.8551        0.8549         0.8554
 (2) 0.9402        0.9401         0.9404
 (3) 0.9633        0.9632         0.9634
*/

(Actually, this could be coded more elegantly by restricting the calculations to the finitely many [<<1E7] possible outcomes of the particular multinomial distribution and weighing the results by the corresponding probabilities. But the run time of the above code was only less than 1 minute.)

 

So, for the particular multinomial distribution Mult(100, 0.1, 0.18, 0.72), only the third confidence region achieves the nominal 95% confidence level. The second comes close, but the first fails, as expected.

 

Conclusion: The PROC CATMOD approach presented in SAS Usage Note 32609 does not seem to provide substantial benefits over using PROC FREQ (here with Bonferroni adjustment):

ods exclude BinomialTest;
proc freq data=a;
weight count / zeros;
tables y / binomial (level='1') alpha=0.01666666666667;
tables y / binomial (level='2') alpha=0.01666666666667;
tables y / binomial (level='3') alpha=0.01666666666667;
run;

There are alternative confidence regions in the literature which can be implemented easily in a data step.

 

SAS Super FREQ
Posts: 3,310

Re: 95% CI for Categorical Variables

If anyone is curious about how to simulate multinomial data in SAS, there are several options

Occasional Contributor LLW
Occasional Contributor
Posts: 15

Re: 95% CI for Categorical Variables

Thank you so much everyone for your help!

 

I adapted the code to account for every single race value in my "race variable" and "missing" so that my 95% CI would account for missing values as well.

 

proc freq data=data;
tables race: / missing binomial(level='1' cl=exact);
tables race: / missing binomial(level='2' cl=exact);
tables race: / missing binomial(level='3' cl=exact);
tables race: / missing binomial(level='4' cl=exact);
tables race: / missing binomial(level='5' cl=exact);
tables race: / missing binomial(level='6' cl=exact);
run;

Trusted Advisor
Posts: 1,114

Re: 95% CI for Categorical Variables

This is fine if your focus is on the individual confidence intervals rather than a simultaneous confidence region.

SAS Super FREQ
Posts: 3,310

Re: 95% CI for Categorical Variables

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 708 views
  • 8 likes
  • 3 in conversation