Statistical Procedures

Programming the statistical procedures from SAS
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LLW
Fluorite | Level 6 LLW
Fluorite | Level 6

Hi,

 

I am wondering how to complete 95% CI for a categorical variable such as race or income bracket.

I know how to do it for a binary variable:

 

proc univariate data=data  cibasic(alpha=0.05);   var binary; run;

 

But how do I do it for a categorical variable such as race?

 

So that I get the following

 

Race

White X, 95% CI x1-x2

Black Y, 95% CI Y1-Y2

Asian Z, 95% cI Z1-Z2

and so on.

 

THanks so much in advance,

Louise

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19
/* Simulation to determine coverage probabilities of the following three confidence regions for
   the parameter vector (p1, p2, p3) of a Mult(100, p1, p2, p3) (multinomial) distribution with
   p1=0.1, p2=0.18, p3=0.72 (example parameter values taken from SAS Usage Note 32609)

   (1) 95% Wald confidence intervals for the individual probabilities (as computed by PROC CATMOD
       in SAS Usage Note 32609, but the same as available from PROC FREQ except for the [trivial]
       limitation of confidence bounds to [0, 1] in PROC FREQ)
   (2) 98.333...% Wald confidence intervals as above (Bonferroni adjustment: (1-0.05/3)*100%), as
       also suggested in R.G. Miller, Simultaneous Statistical Inference, p. 216, formula 10
   (3) Bonferroni-adjusted CIs suggested in formula 14, p. 217, ibid., based on chi-square statistics
*/

%let n=100;   /* parameters of multinomial distribution */
%let p1=0.1;
%let p2=0.18;
%let p3=0.72;
%let c=3;     /* number of categories */
%let a=0.05;  /* alpha */

data sim;
call streaminit(271828);
z=probit(1-&a/2);       /* normal quantile without Bonferroni adjustment */ 
b=probit(1-&a/(2*&c));  /* normal quantile with Bonferroni adjustment */
x=cinv(1-&a/&c,1);      /* chi-square quantile, df=1, with Bonferroni adjustment */
array n[3];             /* samples of the multinomial distribution */
array lcl[3,3];         /* lower confidence limits (indices e.g. [2,3] for interval 2, param. p3) */
array ucl[3,3];         /* upper confidence limits (indices as above) */
array y[3];             /* 0-1 indicators for "(p1,p2,p3) contained in conf. region (1, 2, 3)" */
do i=1 to 1e7;          /* sample size for the simulation */
  n1=0; n2=0; n3=0;     /* initialization */
  do j=1 to &n;         /* generate sample from multinomial distribution */
    v=rand('table', &p1, &p2, &p3);
    n[v]+1;
  end;
  do k=1 to 3;          /* calculate CIs for p1, p2, p3 */
    p=n[k]/&n;          /* point estimate of probability "pk" (k=1, 2, 3) */
    s=sqrt(p*(1-p)/&n); /* approx. standard error of the above point estimate */
    lcl[1,k]=p-z*s;     /* CI 1 */
    ucl[1,k]=p+z*s;
    lcl[2,k]=p-b*s;     /* CI 2 */
    ucl[2,k]=p+b*s;
    lcl[3,k]=(x+2*n[k]-sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x)); /* CI 3 */
    ucl[3,k]=(x+2*n[k]+sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x));
    do m=1 to 3;        /* check if confidence regions cover the true parameter vector */
      y[m]=(lcl[m,1] <= &p1 <= ucl[m,1] &
            lcl[m,2] <= &p2 <= ucl[m,2] &
            lcl[m,3] <= &p3 <= ucl[m,3]);
    end;
  end;
  output;
end;
keep y:;
run;

/* Estimate coverage probabilities of the three confidence regions */

ods select BinomialCLs;
proc freq data=sim;
tables y: / binomial(level='1' cl=exact);
run;

/* Results:

     Point est.    Clopper-Pearson 95% CI
                   
 (1) 0.8551        0.8549         0.8554
 (2) 0.9402        0.9401         0.9404
 (3) 0.9633        0.9632         0.9634
*/

(Actually, this could be coded more elegantly by restricting the calculations to the finitely many [<<1E7] possible outcomes of the particular multinomial distribution and weighing the results by the corresponding probabilities. But the run time of the above code was only less than 1 minute.)

 

So, for the particular multinomial distribution Mult(100, 0.1, 0.18, 0.72), only the third confidence region achieves the nominal 95% confidence level. The second comes close, but the first fails, as expected.

 

Conclusion: The PROC CATMOD approach presented in SAS Usage Note 32609 does not seem to provide substantial benefits over using PROC FREQ (here with Bonferroni adjustment):

ods exclude BinomialTest;
proc freq data=a;
weight count / zeros;
tables y / binomial (level='1') alpha=0.01666666666667;
tables y / binomial (level='2') alpha=0.01666666666667;
tables y / binomial (level='3') alpha=0.01666666666667;
run;

There are alternative confidence regions in the literature which can be implemented easily in a data step.

 

View solution in original post

8 REPLIES 8
Rick_SAS
SAS Super FREQ

PROC UNIVARIATE only analyzes continuous variables.  For discrete variables, you have to use other procedures such as FREQ, LOGISTIC, GENMOD, and CATMOD.

 

It sounds like you want the confidence intervals for a multinomial variable. An internet search reveals the following possibilities:

If you describe the source of your data, we might be able to make particular recommendations.

FreelanceReinh
Jade | Level 19

Hi Louise,

 

I see, Rick was faster. I also had a look at that SAS Usage Note 32609. Depending on whether you have a SAS/IML license, there is the "MULTINOM" or the PROC CATMOD approach.

 

In either case one could add SAS code to automate the process to some extent, so that you don't have to enter, e.g., the response statement for PROC CATMOD manually.

 

Also, I haven't checked yet if these CIs are identical to those in R.G. Miller, Simultaneous Statistical Inference, p. 217, eqn. 14. (Not sure if this is relevant to you.) Both refer to the Bonferroni adjustment.

Edit: I've checked this now. No, the definition in that classic reference is very different. Those CIs are not symmetric about the point estimate. Please see my next post for more details.

FreelanceReinh
Jade | Level 19

I have investigated the CIs proposed in SAS Usage Note 32609 with a primary focus on the PROC CATMOD approach as I don't have SAS/IML licensed.

 

It turned out that the CIs produced by PROC CATMOD (at least in the example given) are merely the same that could be produced by PROC FREQ using the BINOMIAL option of the TABLES statement: the simple, well-known approximate CI for a single proportion. I've observed only three marginal differences:

  1. PROC FREQ, unlike PROC CATMOD, limits the confidence bounds to [0, 1].
  2. Both procedures have difficulties with zero frequencies when using default settings.
    PROC FREQ has the ZEROS option of the WEIGHT statement to produce the degenerate interval [0, 0] in this case.
    There doesn't seem to exist an equivalent option in PROC CATMOD.
  3. PROC FREQ is much faster (but appears to be still slower than calculating the CI in a data step).

Given this comparison, I don't know why that SAS Usage Note suggests PROC CATMOD to perform a typical task for PROC FREQ.

Moreover, there seems to be an error in Example 2 of the document: It says that "Bonferroni adjusted 95% confidence intervals" are calculated using the "MULTINOM" module and they call this module with the appropriate parameter ("S") for simultaneous CIs, but the results presented in the table contain only unadjusted 95% CIs! These are identical to those computed with PROC CATMOD subsequently.

 

Both PROC FREQ (TABLES statement) and PROC CATMOD (MODEL statement) have the common ALPHA= option, so that the Bonferroni adjustment can be achieved by increasing the nominal confidence level (from 95% to 98.333...% in the example with 3 categories).

 

To demonstrate the difference between adjusted and unadjusted CI in terms of coverage probability and to compare these CIs to that (from R.G. Miller's monograph) mentioned in my previous post, I've performed a quick (run time <1 min) simulation. It involves 10 million random samples from one particular multinomial distribution. Please see my next post.

FreelanceReinh
Jade | Level 19
/* Simulation to determine coverage probabilities of the following three confidence regions for
   the parameter vector (p1, p2, p3) of a Mult(100, p1, p2, p3) (multinomial) distribution with
   p1=0.1, p2=0.18, p3=0.72 (example parameter values taken from SAS Usage Note 32609)

   (1) 95% Wald confidence intervals for the individual probabilities (as computed by PROC CATMOD
       in SAS Usage Note 32609, but the same as available from PROC FREQ except for the [trivial]
       limitation of confidence bounds to [0, 1] in PROC FREQ)
   (2) 98.333...% Wald confidence intervals as above (Bonferroni adjustment: (1-0.05/3)*100%), as
       also suggested in R.G. Miller, Simultaneous Statistical Inference, p. 216, formula 10
   (3) Bonferroni-adjusted CIs suggested in formula 14, p. 217, ibid., based on chi-square statistics
*/

%let n=100;   /* parameters of multinomial distribution */
%let p1=0.1;
%let p2=0.18;
%let p3=0.72;
%let c=3;     /* number of categories */
%let a=0.05;  /* alpha */

data sim;
call streaminit(271828);
z=probit(1-&a/2);       /* normal quantile without Bonferroni adjustment */ 
b=probit(1-&a/(2*&c));  /* normal quantile with Bonferroni adjustment */
x=cinv(1-&a/&c,1);      /* chi-square quantile, df=1, with Bonferroni adjustment */
array n[3];             /* samples of the multinomial distribution */
array lcl[3,3];         /* lower confidence limits (indices e.g. [2,3] for interval 2, param. p3) */
array ucl[3,3];         /* upper confidence limits (indices as above) */
array y[3];             /* 0-1 indicators for "(p1,p2,p3) contained in conf. region (1, 2, 3)" */
do i=1 to 1e7;          /* sample size for the simulation */
  n1=0; n2=0; n3=0;     /* initialization */
  do j=1 to &n;         /* generate sample from multinomial distribution */
    v=rand('table', &p1, &p2, &p3);
    n[v]+1;
  end;
  do k=1 to 3;          /* calculate CIs for p1, p2, p3 */
    p=n[k]/&n;          /* point estimate of probability "pk" (k=1, 2, 3) */
    s=sqrt(p*(1-p)/&n); /* approx. standard error of the above point estimate */
    lcl[1,k]=p-z*s;     /* CI 1 */
    ucl[1,k]=p+z*s;
    lcl[2,k]=p-b*s;     /* CI 2 */
    ucl[2,k]=p+b*s;
    lcl[3,k]=(x+2*n[k]-sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x)); /* CI 3 */
    ucl[3,k]=(x+2*n[k]+sqrt(x*(x+4*n[k]*(&n-n[k])/&n)))/(2*(&n+x));
    do m=1 to 3;        /* check if confidence regions cover the true parameter vector */
      y[m]=(lcl[m,1] <= &p1 <= ucl[m,1] &
            lcl[m,2] <= &p2 <= ucl[m,2] &
            lcl[m,3] <= &p3 <= ucl[m,3]);
    end;
  end;
  output;
end;
keep y:;
run;

/* Estimate coverage probabilities of the three confidence regions */

ods select BinomialCLs;
proc freq data=sim;
tables y: / binomial(level='1' cl=exact);
run;

/* Results:

     Point est.    Clopper-Pearson 95% CI
                   
 (1) 0.8551        0.8549         0.8554
 (2) 0.9402        0.9401         0.9404
 (3) 0.9633        0.9632         0.9634
*/

(Actually, this could be coded more elegantly by restricting the calculations to the finitely many [<<1E7] possible outcomes of the particular multinomial distribution and weighing the results by the corresponding probabilities. But the run time of the above code was only less than 1 minute.)

 

So, for the particular multinomial distribution Mult(100, 0.1, 0.18, 0.72), only the third confidence region achieves the nominal 95% confidence level. The second comes close, but the first fails, as expected.

 

Conclusion: The PROC CATMOD approach presented in SAS Usage Note 32609 does not seem to provide substantial benefits over using PROC FREQ (here with Bonferroni adjustment):

ods exclude BinomialTest;
proc freq data=a;
weight count / zeros;
tables y / binomial (level='1') alpha=0.01666666666667;
tables y / binomial (level='2') alpha=0.01666666666667;
tables y / binomial (level='3') alpha=0.01666666666667;
run;

There are alternative confidence regions in the literature which can be implemented easily in a data step.

 

Rick_SAS
SAS Super FREQ

If anyone is curious about how to simulate multinomial data in SAS, there are several options

LLW
Fluorite | Level 6 LLW
Fluorite | Level 6

Thank you so much everyone for your help!

 

I adapted the code to account for every single race value in my "race variable" and "missing" so that my 95% CI would account for missing values as well.

 

proc freq data=data;
tables race: / missing binomial(level='1' cl=exact);
tables race: / missing binomial(level='2' cl=exact);
tables race: / missing binomial(level='3' cl=exact);
tables race: / missing binomial(level='4' cl=exact);
tables race: / missing binomial(level='5' cl=exact);
tables race: / missing binomial(level='6' cl=exact);
run;

FreelanceReinh
Jade | Level 19

This is fine if your focus is on the individual confidence intervals rather than a simultaneous confidence region.

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 11716 views
  • 10 likes
  • 3 in conversation