Solved: Re: 95% CONFIDENCE INTERVALS for categorical variables

Sakshi13 · Posted 10-16-2018 11:14 PM

Hello,

I am trying to calculate 95% CI for categorical variables.

My data is

ID 1 2 3 4

Race Asian white black multiracial

ethnicity hispanic non hispanic hispanic hispanic

I need the output as

Race : Asian n , n%, 95% CI

white n n% 95%CI

black n , n%, 95% CI

multiracial n , n%, 95% CI

Ethnicity : Hispanic n , n%, 95% CI

Non hispanic n , n%, 95% CI

Please suggest.

I did try the proc freq

by race;

tabels race/binomial;

run;

But it does not give me the required results..

Please suggest, thank you.

Rick_SAS · Posted 10-17-2018 09:31 AM

For binary categories, you can use the BINOMIAL option. You can specify the level that you want the CI for. (The other level is 1 - CI, or you can add another TABLES statement, as I've shown below):

proc freq data=sashelp.class;
tables sex / binomial(level='F');
/* CI for level='M' is 1-CI for 'F', or you can use another TABLES stmt"
tables sex / binomial(level='M');
*/
run;

The Cis for multinomial proportions are more challenging because you have to distinguish between individual CIs and simultaneous CIs. Read FreelanceReinhard's suggestions (simulation and Bonferroni adjustments) or I have provided an implementation of computing simultaneous confidence intervals for multinomial proportions.

View solution in original post

PaigeMiller · Posted 10-17-2018 12:50 AM

@Sakshi13 wrote:

Hello,

I am trying to calculate 95% CI for categorical variables.

My data is

ID 1 2 3 4

Race Asian white black multiracial

ethnicity hispanic non hispanic hispanic hispanic

I need the output as

Race : Asian n , n%, 95% CI

white n n% 95%CI

black    n , n%, 95% CI

multiracial  n , n%, 95% CI

Ethnicity : Hispanic   n , n%, 95% CI

Non hispanic  n , n%, 95% CI

Please suggest.

I did try the proc freq

by race;

tabels race/binomial;

run;

But it does not give me the required results..

Please suggest, thank you.

Maybe the problem is that you have mis-spelled TABLES.

"it does not give the required results" — tell us or show us what you did get. If there is an error message, show us the relevant portions of the SASLOG by clicking on {i} and then pasting the relevant part of the SASLOG into that window.

Otherwise, it would really help if you showed us a portion of your data as it exists, by using this method https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

--
Paige Miller

Sakshi13 · Posted 10-17-2018 11:48 AM

Hello, this is my sample data:

data WORK.EXAMPLE;
infile datalines dsd truncover;
input ID:32. RACE:$99. VETRNSTAT:$9. GNDR:$18.;
label ID="ID" RACE="RACE" VETRNSTAT="VETRNSTAT" GNDR="GNDR";
datalines;
.
1 ASIAN YES male
2 ASIAN YES female
3 black or african american YES female
4 black or african american YES female
5 american indian or alaskan native NO male
6 other multi racial YES male
7 other multi racial NO female
8 american indian or alaskan native YES female
;;;;

The output after sorting the data by race and using the proc freq is below, for every race, it defines every race as the only race and then compute the CI. Like in this case if it is american indians then sample size is 3, which should be total sample size of 10. Is this the correct way of computing CI? there is no error in log

SAS Output

The SAS System

The FREQ Procedure

RACE=american indian or alaskan native

RACERACE Frequency Percent CumulativeFrequency CumulativePercentamerican indian or alaskan native

3

100.00

3

100.00

Binomial ProportionRACE = american indian or alaskannativeProportionASE95% Lower Conf Limit95% Upper Conf Limit Exact Conf Limits95% Lower Conf Limit95% Upper Conf Limit

1.0000
0.0000
1.0000
1.0000


0.2924
1.0000

Test of H0: Proportion = 0.5ASE under H0ZOne-sided Pr > ZTwo-sided Pr > |Z|

0.2887

1.7321

0.0416

0.0833

Ksharp · Posted 10-17-2018 09:15 AM

I remembered @Rick_SAS wrote a blog about multi-nominal distribution 's confidence interval .

Rick_SAS · Posted 10-17-2018 09:31 AM

For binary categories, you can use the BINOMIAL option. You can specify the level that you want the CI for. (The other level is 1 - CI, or you can add another TABLES statement, as I've shown below):

proc freq data=sashelp.class;
tables sex / binomial(level='F');
/* CI for level='M' is 1-CI for 'F', or you can use another TABLES stmt"
tables sex / binomial(level='M');
*/
run;

The Cis for multinomial proportions are more challenging because you have to distinguish between individual CIs and simultaneous CIs. Read FreelanceReinhard's suggestions (simulation and Bonferroni adjustments) or I have provided an implementation of computing simultaneous confidence intervals for multinomial proportions.

Sakshi13 · Posted 10-17-2018 12:14 PM

I recoded my race variable as 1, 2, 3, 4, and then used this code:

proc freq data=sugary.combined;
tables race1/binomial(level="1");
tables race1/binomial(level="2");
tables race1/binomial(level="3");
tables race1/binomial(level="4");
run;

no error in log and got the CI for every race. Is this wrong?As race is not a binomial? I am not able to copy the output, how to do that? but did you get the idea?

SAS Output

The SAS System

The FREQ Procedure

race1 Frequency Percent CumulativeFrequency CumulativePercent1234Frequency Missing = 1

2	22.22	2	22.22
2	22.22	4	44.44
3	33.33	7	77.78
2	22.22	9	100.00

Binomial Proportionrace1 = 1ProportionASE95% Lower Conf Limit95% Upper Conf Limit Exact Conf Limits95% Lower Conf Limit95% Upper Conf Limit

0.2222
0.1386
0.0000
0.4938


0.0281
0.6001

Test of H0: Proportion = 0.5ASE under H0ZOne-sided Pr < ZTwo-sided Pr > |Z|

0.1667

-1.6667

0.0478

0.0956

Sample Size = 9
Frequency Missing = 1

race1 Frequency Percent CumulativeFrequency CumulativePercent1234Frequency Missing = 1

2	22.22	2	22.22
2	22.22	4	44.44
3	33.33	7	77.78
2	22.22	9	100.00

Binomial Proportionrace1 = 2ProportionASE95% Lower Conf Limit95% Upper Conf Limit Exact Conf Limits95% Lower Conf Limit95% Upper Conf Limit

0.2222
0.1386
0.0000
0.4938


0.0281
0.6001

Test of H0: Proportion = 0.5ASE under H0ZOne-sided Pr < ZTwo-sided Pr > |Z|

0.1667

-1.6667

0.0478

0.0956

Sample Size = 9
Frequency Missing = 1

race1 Frequency Percent CumulativeFrequency CumulativePercent1234Frequency Missing = 1

2	22.22	2	22.22
2	22.22	4	44.44
3	33.33	7	77.78
2	22.22	9	100.00

Binomial Proportionrace1 = 3ProportionASE95% Lower Conf Limit95% Upper Conf Limit Exact Conf Limits95% Lower Conf Limit95% Upper Conf Limit

0.3333
0.1571
0.0254
0.6413


0.0749
0.7007

Test of H0: Proportion = 0.5ASE under H0ZOne-sided Pr < ZTwo-sided Pr > |Z|

0.1667

-1.0000

0.1587

0.3173

Sample Size = 9
Frequency Missing = 1

race1 Frequency Percent CumulativeFrequency CumulativePercent1234Frequency Missing = 1

2	22.22	2	22.22
2	22.22	4	44.44
3	33.33	7	77.78
2	22.22	9	100.00

Binomial Proportionrace1 = 4ProportionASE95% Lower Conf Limit95% Upper Conf Limit Exact Conf Limits95% Lower Conf Limit95% Upper Conf Limit

0.2222
0.1386
0.0000
0.4938


0.0281
0.6001

Test of H0: Proportion = 0.5ASE under H0ZOne-sided Pr < ZTwo-sided Pr > |Z|

0.1667

-1.6667

0.0478

0.0956

Sample Size = 9
Frequency Missing = 1

ballardw · Posted 10-17-2018 11:10 AM

@Sakshi13 wrote:

Hello,

I am trying to calculate 95% CI for categorical variables.

My data is

ID 1 2 3 4

Race Asian white black multiracial

ethnicity hispanic non hispanic hispanic hispanic

I need the output as

Race : Asian n , n%, 95% CI

white n n% 95%CI

black    n , n%, 95% CI

multiracial  n , n%, 95% CI

Ethnicity : Hispanic   n , n%, 95% CI

Non hispanic  n , n%, 95% CI

Please suggest.

I did try the proc freq

by race;

tabels race/binomial;

run;

But it does not give me the required results..

Please suggest, thank you.

Are you looking for a confidence interval of N, the N% or something else?

Example starting data does help.

Sakshi13 · Posted 10-17-2018 12:01 PM

I am looking for the CI for N

StatDave · Posted 10-17-2018 11:13 AM

See Example 2 in this note.

StatDave · Posted 10-17-2018 03:03 PM

Note that what you are asking for confidence intervals for a multinomial distribution. Treating it as separate binary distributions will not yield correct results.

Even if you want confidence intervals on the counts (N_i) instead of the probabilities, you can still start with Example 2 in this note. The parameter estimates table gives the estimates of the probabilities and their standard errors. From these you can get the estimates of the counts by multiplying the estimated probabilities by the total sample size (N). The standard error for the estimated count is the total sample size times the probability standard error. You can then form a 95% confidence interval. Using the data in Example 2 from the note:

     data a;
        input y count;
        datalines;
      1 10
      2 18
      3 72
      ;
      proc catmod data=a;
        response 1 0 0, 0 1 0;
        weight count;
        model y= ;
        ods output estimates=pe;
        run; quit;
data ci;
  set pe;
  N=100*estimate;
  lower=N - probit(1-.05/2)*(100*stderr);
  upper=N + probit(1-.05/2)*(100*stderr);
  run;
proc print;
  var n lower upper;
  run;
      proc catmod data=a;
        response 0 0 1;
        weight count;
        model y= ;
        ods output estimates=pe;
        run; quit;
data ci;
  set pe;
  N=100*estimate;
  lower=N - probit(1-.05/2)*(100*stderr);
  upper=N + probit(1-.05/2)*(100*stderr);
  run;
proc print;
  var n lower upper;
  run;

Sakshi13 · Posted 10-17-2018 04:15 PM

Thank you so much for the quick response 🙂

2025 SAS Hackathon: There is still time!