BookmarkSubscribeRSS Feed
DD6410
Calcite | Level 5

Hi,

 

I have a dataset with the following columns:

 

Member ID

Center

Mother_Race

1

0

African American

2

0

White

3

1

White

4

0

Hispanic

5

1

Other

6

1

Hispanic

7

0

Not Reported

Etc.

 

 

 

I want to test the difference in column proportions between each level of the race variable (5 level categorical) by the center variable (dichotomous). I've run a chi-square test, which gives me the overall p-value:

 

proc freq data=have;

title "Chi Square Test";

tables mother_race*center / chisq measures

plots=(freqplot(twoway=groupvertical scale=percent));

run;

 

Now I want to test the difference in column percentages for each level of the race variable in order to produce a table like this:

 

 

Center=0

Center=1

Mother Race

%

%

Hispanic

50.0

46.9

Black

26.3**

39.8

White

18.6*

8.3

Other

2.7

2.4

Not Reported

2.4

2.6

Total

100

100

 

I've tried doing this through proc logistic (below) but I lose one of the race levels to a reference group.

 

proc logistic data=have descending;

 class center / param=ref;

 model mother_race = center / link=glogit;

 run;

 

Is there a popular/accepted way test the levels of a categorical variable across a binary variable? I feel like there must be an easy solution but I'm new to SAS (9.4) and can't seem to figure it out. If it matters, the "center" variable is not balanced in terms of sample size (1,100 vs. 18,000). Thanks for any guidance.

 

Daniel

3 REPLIES 3
PGStats
Opal | Level 21

A binomial test for each mother race?

 

proc sort data=have; by mother_race; run;

proc freq data=have;
title "Binomial Tests for each Mother Race";
by mother_race;
tables center / binomial;
run;

 

PG
DD6410
Calcite | Level 5

Thank you, PG. The binomial test seems to be testing the difference in row proportions for each race level. (For example, the race level "White" has 3,354 cases in center=0 and 93 cases in center=1, so the binomial test is testing the difference between 97.3% and 2.7%.)

 

Instead, I'd like to test the difference in column proportions. Among all center cases, 8.3% are white. Among all non-center cases, 18.6% are white. This is the difference I'd like to test. Thanks for your help.

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

I'll put forth this code suggestion. I'm not 100% confident (Edit, OK, 95% confident) in its validity, but others are welcome to critique. (Edit: I just noticed that you provided the actual sample sizes. You can swap them in the code below.)

 

/*  Make up some data */
data have; /* Total sample size for center=0 is 200 */ mrace="H"; center=0; count=round(0.500*200, 1); output; mrace="B"; center=0; count=round(0.263*200, 1); output; mrace="W"; center=0; count=round(0.186*200, 1); output; mrace="O"; center=0; count=round(0.027*200, 1); output; mrace="N"; center=0; count=round(0.024*200, 1); output; /* Total sample size for center=1 is 240 */ mrace="H"; center=1; count=round(0.469*240, 1); output; mrace="B"; center=1; count=round(0.398*240, 1); output; mrace="W"; center=1; count=round(0.083*240, 1); output; mrace="O"; center=1; count=round(0.024*240, 1); output; mrace="N"; center=1; count=round(0.026*240, 1); output; run;
/* Check column totals */ proc means data=have sum; var count; by center; run;
/* Create offset variable */ data have; set have; if center=0 then total=200; else if center=1 then total=241; /* effect of rounding to integers */ total_log = log(total); grand_total_log = log(200+241); run; /* Chi-square test of homogeneity of proportions */ proc freq data=have; table mrace*center / chisq; weight count; run; /* Log-linear model approach */ /* Define proportions on column totals using offset */ proc genmod data=have; class center mrace; model count = center mrace center*mrace / dist=poisson type3 offset=total_log ; lsmeans center*mrace / ilink; lsmestimate center*mrace "B diff" 1 0 0 0 0 -1 0 0 0 0, "H diff" 0 1 0 0 0 0 -1 0 0 0, "N diff" 0 0 1 0 0 0 0 -1 0 0, "O diff" 0 0 0 1 0 0 0 0 -1 0, "W diff" 0 0 0 0 1 0 0 0 0 -1 / adjust=simulate(seed=12345); run;

An alternative approach, that I think is what @PGStats had in mind, is to create a subset of the data with one level of mother_race versus all others combined. You'd then need to consider some form of Type I error control for the family of 5 tests, which you could do with the MULTTEST procedure. For example, for mrace=B

 

/*  Create a variable with levels (B=1, notB=0) */
data dsB;
    set have;
    B = (mrace="B");
    run;
proc sort data=dsB;
    by center B;
proc means data=dsB noprint;
    by center B;
    var count;
    output out=testB sum=count;
    run;
data testB;
    set testB;
    if center=0 then total=200;
    else if center=1 then total=241;
    total_log = log(total);
    grand_total_log = log(200+241);
    run;
/*  Chi-square test of homogeneity of proportions */
proc freq data=testB;
    table B*center / chisq; 
    weight count;
    run;
/*  Log-linear model approach */
/*  Define proportions on column totals */
proc genmod data=testB;
    class center B;
    model count = center B center*B / dist=poisson type3 offset=total_log ; 
    lsmeans center*B / ilink;
    lsmestimate center*B "B diff" 0 1 0 -1 ; /* p value here is based on z-test, not chi-sq */
    run;

Ideally in this case, the test for mrace*center (generated by the MODEL statement) would match the test of B proportions equal between center levels (generated by the LSMESTIMATE statement); the two tests deliver the same story (yay!), but the test statistics are not the same, hence the p-values are not the same. I don't know if there's a way to get LSMESTIMATE to produce a chi-square test; I was not successful with what I tried. But the z-test might be good enough, especially if sample sizes are big enough.

 

 

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 3425 views
  • 3 likes
  • 3 in conversation