Dear statistical enthusiasts, I'm trying to assess the relationship between the cancer incidence and the various environmental exposure (&ejvar) using poisson regression model at census tract level. Since cancer incidence is highly age-dependent, I have to adjust my poisson regression model to age.
My question is how to adjust my model to the effect of age?
I came across two main approaches, that could be used but I'm not sure if any of these two are more stat robus than another.
The first approach is to use percent of people for all 14 age groups for each census tract. In this case, my data has 4918 points which is the length to contain all distinct census tract of NY state. (wide format)
The second approach is to use number of people for all 14 age group and listed vertically thus my data has around 4918*14 age groups ~ 68,852 points (long format).
I hope attached images would help to understand what I mean by wide and long formats.
I thought, either way would do and lead to comparable results. However, model fit statistics show that wide format data with percent of each group fit data better based on AIC, BIC and the full likelihood statistics. Also, final results vary depending on the choice of either of these two approaches for the age adjustment.
I'd like to confirm with you guys, which approach is statistically more robust? and worked better from your expeience, if any?
SAS codes for wide and long format data is following.
PROC GENMOD DATA=MYDATA;
MODEL N_TRACT= &ejvar AGE5 AGE59 AGE85 AGE1014 AGE1519 AGE2024 AGE2529
AGE3034 AGE3539 AGE4044 AGE4549 AGE5054 AGE5559 AGE6064 AGE6569 AGE7074
AGE7579 AGE8084 POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;
PROC GENMOD DATA=MYDATA;
CLASS AGECAT(REF='10')/PARAM=REF;
MODEL N_TRACT= &ejvar AGECAT POVERTY/DIST=POISSON LINK=LOG
OFFSET=LN MAXITER=1000;
RUN;
A model with CLASS variables is never going to give the same results as a model without CLASS variables.
Further, I don't see how the data from FIPS 36001000100 in the wide format was translated into the FIPS 36001000100 in the long format. But it probably doesn't matter.
Which is correct/better? Well, that probably requires a better understanding of the data than I have, and a better understanding of the end goals of this analysis, which I don't have. But I will say I don't usually like the idea of using CLASS variables when your data really isn't class variables, and in order for this to make sense, you need to have a very strong justification for using the CLASS variables.
So let me see if I am understanding this properly.
In the one case, you are using dummy variables (0s and 1s) and in the other case you are using the actual numeric value of 14 different numeric variables.
Is that right?
@PaigeMiller , I appreciate you're picking up on my question. Thanks a lot! I really have to get this right, obvisouly.
No, I'm not using dummy variable. In the wide format with _N_=4918(N of all census tracts within Ny state) all I have and using in the model is shown in below proc print(obs=5). Percent of age groups and poverty by census tract is based on American Community Survey 2011-2015. This makes my proc genmodfollowing;
PROC GENMOD DATA=mydata;
MODEL N_TRACT= X_CANCER AGE5 AGE59 AGE85 AGE1014 AGE1519 AGE2024 AGE2529 AGE3034
AGE3539 AGE4044 AGE4549 AGE5054 AGE5559 AGE6064 AGE6569 AGE7074 AGE7579 AGE8084
POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;
QUIT;
Obs | FIPS_TRACT | AGE5 | AGE59 | AGE1014 | AGE1519 | AGE2024 | AGE2529 | AGE3034 | AGE3539 | AGE4044 | AGE4549 | AGE5054 | AGE5559 | AGE6064 | AGE6569 | AGE7074 | AGE7579 | AGE8084 | AGE85 | X_CANCER | POVERTY | LN | TOTAL | N_TRACT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 36001000100 | 11 | 6 | 5.8 | 6.8 | 4.5 | 5.1 | 8 | 5.4 | 4.8 | 6.8 | 7.7 | 7.5 | 8.1 | 6.5 | 2.5 | 1.4 | 1.4 | 0.7 | 42.5793 | 3.8 | 7.56993 | 1939 | 1 |
2 | 36001000200 | 5.5 | 7.5 | 6.9 | 5.5 | 12.9 | 8.4 | 8.2 | 2.5 | 9.5 | 5.7 | 12 | 6.1 | 3.2 | 2.3 | 1.5 | 0.4 | 1.2 | 0.9 | 53.1528 | 3.7 | 8.46189 | 4731 | 1 |
3 | 36001000300 | 6.6 | 4.4 | 9.3 | 3.1 | 12.5 | 8 | 6.6 | 2.8 | 3.7 | 5 | 5.4 | 9.9 | 3.8 | 4.2 | 5.4 | 3.3 | 1.8 | 4 | 41.8343 | 3.4 | 8.62299 | 5558 | 2 |
4 | 36001000401 | 4.5 | 1.1 | 1.4 | 1.2 | 0 | 7.1 | 8.3 | 1.3 | 5.9 | 3.1 | 7.8 | 4.9 | 4.2 | 6.5 | 3.5 | 4.4 | 8.7 | 25.9 | 32.4226 | 6.0 | 7.80751 | 2459 | 0 |
5 | 36001000403 | 6.6 | 1.8 | 3.5 | 7 | 14.5 | 7.1 | 10.5 | 5.6 | 4.5 | 7.1 | 7.2 | 4.9 | 3.9 | 4.5 | 5.5 | 1.7 | 1.7 | 2.4 | 40.1327 | 1.6 | 8.45425 | 4695 | 1 |
In contrast, long-format data for a single census tract is shown below. Accordingly, my model reflects on it.
PROC GENMOD DATA=mydata;
CLASS AGECAT(REF='10')/PARAM=REF;
MODEL N_TRACT= X_CANCER AGECAT POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;
QUIT;
Obs | FIPS_TRACT | AGECAT | N_TRACT | POVERTY | X_CANCER | LN | TRACT_POP |
---|---|---|---|---|---|---|---|
1 | 36001000100 | 1 | 0 | 38.2 | 42.5793 | 6.75344 | 857 |
2 | 36001000100 | 10 | 0 | 38.2 | 42.5793 | 4.43082 | 84 |
3 | 36001000100 | 11 | 0 | 38.2 | 42.5793 | 3.82864 | 46 |
4 | 36001000100 | 12 | 0 | 38.2 | 42.5793 | 3.40120 | 30 |
5 | 36001000100 | 13 | 0 | 38.2 | 42.5793 | 3.58352 | 36 |
6 | 36001000100 | 14 | 0 | 38.2 | 42.5793 | 3.09104 | 22 |
7 | 36001000100 | 2 | 0 | 38.2 | 42.5793 | 4.98361 | 146 |
8 | 36001000100 | 3 | 0 | 38.2 | 42.5793 | 4.90527 | 135 |
9 | 36001000100 | 4 | 1 | 38.2 | 42.5793 | 4.86753 | 130 |
10 | 36001000100 | 5 | 0 | 38.2 | 42.5793 | 4.95583 | 142 |
11 | 36001000100 | 6 | 0 | 38.2 | 42.5793 | 5.06260 | 158 |
12 | 36001000100 | 7 | 0 | 38.2 | 42.5793 | 5.01064 | 150 |
13 | 36001000100 | 8 | 0 | 38.2 | 42.5793 | 4.84419 | 127 |
14 | 36001000100 | 9 | 0 | 38.2 | 42.5793 | 4.33073 | 76 |
Sorry for separate posts. I thought my posts are long because of screen shots. My question is: which approach adjusts the model to the effect of age appropriately for a population based study like this? My final results between wide vs long-format based approaches vary. That makes me suspicious whether I'm doing wide format (using percent of age groups) right? Someone who does this modelling frequently suggested me a wide-format approach. When you mention dummy, I feel like, hey, am I missing that puzzle in here? If so, how to introduce dummy into this context? Thanks PaigeMiller!
A model with CLASS variables is never going to give the same results as a model without CLASS variables.
Further, I don't see how the data from FIPS 36001000100 in the wide format was translated into the FIPS 36001000100 in the long format. But it probably doesn't matter.
Which is correct/better? Well, that probably requires a better understanding of the data than I have, and a better understanding of the end goals of this analysis, which I don't have. But I will say I don't usually like the idea of using CLASS variables when your data really isn't class variables, and in order for this to make sense, you need to have a very strong justification for using the CLASS variables.
I agree with everything your said. Make sense to me. In addition, I dropped one of age groups off of the model so the percent doesn't sum up to 100% and cause mulyicolinearity. As far as cancer concerned, dropping the earlier age group made sense. Not to my surprise, this had improved the model fit.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.