Test difference in proportions for each level of categorical variable ...

DD6410 · Posted 12-13-2016 05:34 PM

Hi,

I have a dataset with the following columns:

Member ID	Center	Mother_Race
1	0	African American
2	0	White
3	1	White
4	0	Hispanic
5	1	Other
6	1	Hispanic
7	0	Not Reported
Etc.

I want to test the difference in column proportions between each level of the race variable (5 level categorical) by the center variable (dichotomous). I've run a chi-square test, which gives me the overall p-value:

proc freq data=have;

title "Chi Square Test";

tables mother_race*center / chisq measures

plots=(freqplot(twoway=groupvertical scale=percent));

run;

Now I want to test the difference in column percentages for each level of the race variable in order to produce a table like this:

	Center=0	Center=1
Mother Race	%	%
Hispanic	50.0	46.9
Black	26.3**	39.8
White	18.6*	8.3
Other	2.7	2.4
Not Reported	2.4	2.6
Total	100	100

I've tried doing this through proc logistic (below) but I lose one of the race levels to a reference group.

proc logistic data=have descending;

class center / param=ref;

model mother_race = center / link=glogit;

run;

Is there a popular/accepted way test the levels of a categorical variable across a binary variable? I feel like there must be an easy solution but I'm new to SAS (9.4) and can't seem to figure it out. If it matters, the "center" variable is not balanced in terms of sample size (1,100 vs. 18,000). Thanks for any guidance.

Daniel

PGStats · Posted 12-13-2016 06:27 PM

A binomial test for each mother race?

proc sort data=have; by mother_race; run;

proc freq data=have;
title "Binomial Tests for each Mother Race";
by mother_race;
tables center / binomial;
run;

PG

DD6410 · Posted 12-14-2016 09:58 AM

Thank you, PG. The binomial test seems to be testing the difference in row proportions for each race level. (For example, the race level "White" has 3,354 cases in center=0 and 93 cases in center=1, so the binomial test is testing the difference between 97.3% and 2.7%.)

Instead, I'd like to test the difference in column proportions. Among all center cases, 8.3% are white. Among all non-center cases, 18.6% are white. This is the difference I'd like to test. Thanks for your help.

sld · Posted 12-14-2016 12:54 PM

I'll put forth this code suggestion. I'm not 100% confident (Edit, OK, 95% confident) in its validity, but others are welcome to critique. (Edit: I just noticed that you provided the actual sample sizes. You can swap them in the code below.)

/*  Make up some data */
data have;
    /* Total sample size for center=0 is 200 */
    mrace="H"; center=0; count=round(0.500*200, 1); output;
    mrace="B"; center=0; count=round(0.263*200, 1); output;
    mrace="W"; center=0; count=round(0.186*200, 1); output;
    mrace="O"; center=0; count=round(0.027*200, 1); output;
    mrace="N"; center=0; count=round(0.024*200, 1); output;
    /* Total sample size for center=1 is 240 */
    mrace="H"; center=1; count=round(0.469*240, 1); output;
    mrace="B"; center=1; count=round(0.398*240, 1); output;
    mrace="W"; center=1; count=round(0.083*240, 1); output;
    mrace="O"; center=1; count=round(0.024*240, 1); output;
    mrace="N"; center=1; count=round(0.026*240, 1); output;
    run;
/*  Check column totals */
proc means data=have sum;
    var count;
    by center;
    run;
/*  Create offset variable */
data have;
    set have;
    if center=0 then total=200;
    else if center=1 then total=241; /* effect of rounding to integers */
    total_log = log(total);
    grand_total_log = log(200+241);
    run;
/*  Chi-square test of homogeneity of proportions */
proc freq data=have;
    table mrace*center / chisq; 
    weight count;
    run;
/*  Log-linear model approach */
/*  Define proportions on column totals using offset */
proc genmod data=have;
    class center mrace;
    model count = center mrace center*mrace / dist=poisson type3 offset=total_log ;
    lsmeans center*mrace / ilink;
    lsmestimate center*mrace "B diff" 1 0 0 0 0 -1 0 0 0 0,
                             "H diff" 0 1 0 0 0 0 -1 0 0 0,
                             "N diff" 0 0 1 0 0 0 0 -1 0 0,
                             "O diff" 0 0 0 1 0 0 0 0 -1 0,
                             "W diff" 0 0 0 0 1 0 0 0 0 -1
        / adjust=simulate(seed=12345);
    run;

An alternative approach, that I think is what @PGStats had in mind, is to create a subset of the data with one level of mother_race versus all others combined. You'd then need to consider some form of Type I error control for the family of 5 tests, which you could do with the MULTTEST procedure. For example, for mrace=B

/*  Create a variable with levels (B=1, notB=0) */
data dsB;
    set have;
    B = (mrace="B");
    run;
proc sort data=dsB;
    by center B;
proc means data=dsB noprint;
    by center B;
    var count;
    output out=testB sum=count;
    run;
data testB;
    set testB;
    if center=0 then total=200;
    else if center=1 then total=241;
    total_log = log(total);
    grand_total_log = log(200+241);
    run;
/*  Chi-square test of homogeneity of proportions */
proc freq data=testB;
    table B*center / chisq; 
    weight count;
    run;
/*  Log-linear model approach */
/*  Define proportions on column totals */
proc genmod data=testB;
    class center B;
    model count = center B center*B / dist=poisson type3 offset=total_log ; 
    lsmeans center*B / ilink;
    lsmestimate center*B "B diff" 0 1 0 -1 ; /* p value here is based on z-test, not chi-sq */
    run;

Ideally in this case, the test for mrace*center (generated by the MODEL statement) would match the test of B proportions equal between center levels (generated by the LSMESTIMATE statement); the two tests deliver the same story (yay!), but the test statistics are not the same, hence the p-values are not the same. I don't know if there's a way to get LSMESTIMATE to produce a chi-square test; I was not successful with what I tried. But the z-test might be good enough, especially if sample sizes are big enough.

Test difference in proportions for each level of categorical variable across binary variable

Re: Test difference in proportions for each level of categorical variable across binary variable

Re: Test difference in proportions for each level of categorical variable across binary variable

Re: Test difference in proportions for each level of categorical variable across binary variable