12132016 05:34 PM
Hi,
I have a dataset with the following columns:
Member ID  Center  Mother_Race 
1  0  African American 
2  0  White 
3  1  White 
4  0  Hispanic 
5  1  Other 
6  1  Hispanic 
7  0  Not Reported 
Etc. 


I want to test the difference in column proportions between each level of the race variable (5 level categorical) by the center variable (dichotomous). I've run a chisquare test, which gives me the overall pvalue:
proc freq data=have;
title "Chi Square Test";
tables mother_race*center / chisq measures
plots=(freqplot(twoway=groupvertical scale=percent));
run;
Now I want to test the difference in column percentages for each level of the race variable in order to produce a table like this:
 Center=0  Center=1 
Mother Race  %  % 
Hispanic  50.0  46.9 
Black  26.3**  39.8 
White  18.6*  8.3 
Other  2.7  2.4 
Not Reported  2.4  2.6 
Total  100  100 
I've tried doing this through proc logistic (below) but I lose one of the race levels to a reference group.
proc logistic data=have descending;
class center / param=ref;
model mother_race = center / link=glogit;
run;
Is there a popular/accepted way test the levels of a categorical variable across a binary variable? I feel like there must be an easy solution but I'm new to SAS (9.4) and can't seem to figure it out. If it matters, the "center" variable is not balanced in terms of sample size (1,100 vs. 18,000). Thanks for any guidance.
Daniel
12132016 06:27 PM
A binomial test for each mother race?
proc sort data=have; by mother_race; run;
proc freq data=have;
title "Binomial Tests for each Mother Race";
by mother_race;
tables center / binomial;
run;
12142016 09:58 AM
Thank you, PG. The binomial test seems to be testing the difference in row proportions for each race level. (For example, the race level "White" has 3,354 cases in center=0 and 93 cases in center=1, so the binomial test is testing the difference between 97.3% and 2.7%.)
Instead, I'd like to test the difference in column proportions. Among all center cases, 8.3% are white. Among all noncenter cases, 18.6% are white. This is the difference I'd like to test. Thanks for your help.
12142016 12:54 PM  edited 12152016 01:17 AM
I'll put forth this code suggestion. I'm not 100% confident (Edit, OK, 95% confident) in its validity, but others are welcome to critique. (Edit: I just noticed that you provided the actual sample sizes. You can swap them in the code below.)
/* Make up some data */
data have; /* Total sample size for center=0 is 200 */ mrace="H"; center=0; count=round(0.500*200, 1); output; mrace="B"; center=0; count=round(0.263*200, 1); output; mrace="W"; center=0; count=round(0.186*200, 1); output; mrace="O"; center=0; count=round(0.027*200, 1); output; mrace="N"; center=0; count=round(0.024*200, 1); output; /* Total sample size for center=1 is 240 */ mrace="H"; center=1; count=round(0.469*240, 1); output; mrace="B"; center=1; count=round(0.398*240, 1); output; mrace="W"; center=1; count=round(0.083*240, 1); output; mrace="O"; center=1; count=round(0.024*240, 1); output; mrace="N"; center=1; count=round(0.026*240, 1); output; run;
/* Check column totals */ proc means data=have sum; var count; by center; run;
/* Create offset variable */ data have; set have; if center=0 then total=200; else if center=1 then total=241; /* effect of rounding to integers */ total_log = log(total); grand_total_log = log(200+241); run; /* Chisquare test of homogeneity of proportions */ proc freq data=have; table mrace*center / chisq; weight count; run; /* Loglinear model approach */ /* Define proportions on column totals using offset */ proc genmod data=have; class center mrace; model count = center mrace center*mrace / dist=poisson type3 offset=total_log ; lsmeans center*mrace / ilink; lsmestimate center*mrace "B diff" 1 0 0 0 0 1 0 0 0 0, "H diff" 0 1 0 0 0 0 1 0 0 0, "N diff" 0 0 1 0 0 0 0 1 0 0, "O diff" 0 0 0 1 0 0 0 0 1 0, "W diff" 0 0 0 0 1 0 0 0 0 1 / adjust=simulate(seed=12345); run;
An alternative approach, that I think is what @PGStats had in mind, is to create a subset of the data with one level of mother_race versus all others combined. You'd then need to consider some form of Type I error control for the family of 5 tests, which you could do with the MULTTEST procedure. For example, for mrace=B
/* Create a variable with levels (B=1, notB=0) */ data dsB; set have; B = (mrace="B"); run; proc sort data=dsB; by center B; proc means data=dsB noprint; by center B; var count; output out=testB sum=count; run; data testB; set testB; if center=0 then total=200; else if center=1 then total=241; total_log = log(total); grand_total_log = log(200+241); run; /* Chisquare test of homogeneity of proportions */ proc freq data=testB; table B*center / chisq; weight count; run; /* Loglinear model approach */ /* Define proportions on column totals */ proc genmod data=testB; class center B; model count = center B center*B / dist=poisson type3 offset=total_log ; lsmeans center*B / ilink; lsmestimate center*B "B diff" 0 1 0 1 ; /* p value here is based on ztest, not chisq */ run;
Ideally in this case, the test for mrace*center (generated by the MODEL statement) would match the test of B proportions equal between center levels (generated by the LSMESTIMATE statement); the two tests deliver the same story (yay!), but the test statistics are not the same, hence the pvalues are not the same. I don't know if there's a way to get LSMESTIMATE to produce a chisquare test; I was not successful with what I tried. But the ztest might be good enough, especially if sample sizes are big enough.