BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

Dear statistical enthusiasts, I'm trying to assess the relationship between the cancer incidence and the various environmental exposure (&ejvar) using poisson regression model at census tract level. Since cancer incidence is highly age-dependent, I have to adjust my poisson regression model to age.

 

My question is how to adjust my model to the effect of age?

 

I came across two main approaches, that could be used but I'm not sure if any of these two are more stat robus than another.

  1. The first approach is to use percent of people for all 14 age groups for each census tract. In this case, my data has 4918 points which is the length to contain all distinct census tract of NY state. (wide format)

  2. The second approach is to use number of people for all 14 age group and listed vertically thus my data has around 4918*14 age groups ~ 68,852 points (long format).

I hope attached images would help to understand what I mean by wide and long formats.

I thought, either way would do and lead to comparable results. However, model fit statistics show that wide format data with percent of each group fit data better based on AIC, BIC and the full likelihood statistics. Also, final results vary depending on the choice of either of these two approaches for the age adjustment.

I'd like to confirm with you guys, which approach is statistically more robust? and worked better from your expeience, if any?

SAS codes for wide and long format data is following.

 

PROC GENMOD DATA=MYDATA;
MODEL N_TRACT= &ejvar AGE5 AGE59 AGE85 AGE1014 AGE1519 AGE2024 AGE2529 
AGE3034 AGE3539 AGE4044 AGE4549 AGE5054 AGE5559 AGE6064 AGE6569 AGE7074
AGE7579 AGE8084 POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000; RUN; PROC GENMOD DATA=MYDATA; CLASS AGECAT(REF='10')/PARAM=REF; MODEL N_TRACT= &ejvar AGECAT POVERTY/DIST=POISSON LINK=LOG
OFFSET=LN MAXITER=1000; RUN;

AGE_ADJUSTMENT.pngWIDE VS LONG DATA FORMATS.png

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

A model with CLASS variables is never going to give the same results as a model without CLASS variables.

 

Further, I don't see how the data from FIPS 36001000100 in the wide format was translated into the FIPS 36001000100 in the long format. But it probably doesn't matter.

 

Which is correct/better? Well, that probably requires a better understanding of the data than I have, and a better understanding of the end goals of this analysis, which I don't have. But I will say I don't usually like the idea of using CLASS variables when your data really isn't class variables, and in order for this to make sense, you need to have a very strong justification for using the CLASS variables.

--
Paige Miller

View solution in original post

6 REPLIES 6
PaigeMiller
Diamond | Level 26

So let me see if I am understanding this properly.

 

In the one case, you are using dummy variables (0s and 1s) and in the other case you are using the actual numeric value of 14 different numeric variables.

 

Is that right?

--
Paige Miller
Cruise
Ammonite | Level 13

@PaigeMiller , I appreciate you're picking up on my question. Thanks a lot! I really have to get this right, obvisouly.

 

No, I'm not using dummy variable. In the wide format with _N_=4918(N of all census tracts within Ny state) all I have and using in the model is shown in below proc print(obs=5). Percent of age groups and poverty by census tract is based on American Community Survey 2011-2015. This makes my proc genmodfollowing;

 

PROC GENMOD DATA=mydata;
MODEL N_TRACT= X_CANCER AGE5 AGE59 AGE85 AGE1014 AGE1519 AGE2024 AGE2529 AGE3034 
AGE3539 AGE4044 AGE4549 AGE5054 AGE5559 AGE6064 AGE6569 AGE7074 AGE7579 AGE8084 
POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;
QUIT;
Obs FIPS_TRACT AGE5 AGE59 AGE1014 AGE1519 AGE2024 AGE2529 AGE3034 AGE3539 AGE4044 AGE4549 AGE5054 AGE5559 AGE6064 AGE6569 AGE7074 AGE7579 AGE8084 AGE85 X_CANCER POVERTY LN TOTAL N_TRACT
1 36001000100 11 6 5.8 6.8 4.5 5.1 8 5.4 4.8 6.8 7.7 7.5 8.1 6.5 2.5 1.4 1.4 0.7 42.5793 3.8 7.56993 1939 1
2 36001000200 5.5 7.5 6.9 5.5 12.9 8.4 8.2 2.5 9.5 5.7 12 6.1 3.2 2.3 1.5 0.4 1.2 0.9 53.1528 3.7 8.46189 4731 1
3 36001000300 6.6 4.4 9.3 3.1 12.5 8 6.6 2.8 3.7 5 5.4 9.9 3.8 4.2 5.4 3.3 1.8 4 41.8343 3.4 8.62299 5558 2
4 36001000401 4.5 1.1 1.4 1.2 0 7.1 8.3 1.3 5.9 3.1 7.8 4.9 4.2 6.5 3.5 4.4 8.7 25.9 32.4226 6.0 7.80751 2459 0
5 36001000403 6.6 1.8 3.5 7 14.5 7.1 10.5 5.6 4.5 7.1 7.2 4.9 3.9 4.5 5.5 1.7 1.7 2.4 40.1327 1.6 8.45425 4695 1
Cruise
Ammonite | Level 13

In contrast, long-format data for a single census tract is shown below. Accordingly, my model reflects on it.

 

PROC GENMOD DATA=mydata;
CLASS AGECAT(REF='10')/PARAM=REF;
MODEL N_TRACT= X_CANCER AGECAT POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;
QUIT;
Obs FIPS_TRACT AGECAT N_TRACT POVERTY X_CANCER LN TRACT_POP
1 36001000100 1 0 38.2 42.5793 6.75344 857
2 36001000100 10 0 38.2 42.5793 4.43082 84
3 36001000100 11 0 38.2 42.5793 3.82864 46
4 36001000100 12 0 38.2 42.5793 3.40120 30
5 36001000100 13 0 38.2 42.5793 3.58352 36
6 36001000100 14 0 38.2 42.5793 3.09104 22
7 36001000100 2 0 38.2 42.5793 4.98361 146
8 36001000100 3 0 38.2 42.5793 4.90527 135
9 36001000100 4 1 38.2 42.5793 4.86753 130
10 36001000100 5 0 38.2 42.5793 4.95583 142
11 36001000100 6 0 38.2 42.5793 5.06260 158
12 36001000100 7 0 38.2 42.5793 5.01064 150
13 36001000100 8 0 38.2 42.5793 4.84419 127
14 36001000100 9 0 38.2 42.5793 4.33073 76
Cruise
Ammonite | Level 13

@PaigeMiller ,

Sorry for separate posts. I thought my posts are long because of screen shots. My question is: which approach adjusts the model to the effect of age appropriately for a population based study like this? My final results between wide vs long-format based approaches vary. That makes me suspicious whether I'm doing wide format (using percent of age groups) right? Someone who does this modelling frequently suggested me a wide-format approach. When you mention dummy, I feel like, hey, am I missing that puzzle in here? If so, how to introduce dummy into this context? Thanks PaigeMiller!

PaigeMiller
Diamond | Level 26

A model with CLASS variables is never going to give the same results as a model without CLASS variables.

 

Further, I don't see how the data from FIPS 36001000100 in the wide format was translated into the FIPS 36001000100 in the long format. But it probably doesn't matter.

 

Which is correct/better? Well, that probably requires a better understanding of the data than I have, and a better understanding of the end goals of this analysis, which I don't have. But I will say I don't usually like the idea of using CLASS variables when your data really isn't class variables, and in order for this to make sense, you need to have a very strong justification for using the CLASS variables.

--
Paige Miller
Cruise
Ammonite | Level 13

@PaigeMiller 

 

I agree with everything your said. Make sense to me. In addition, I dropped one of age groups off of the model so the percent doesn't sum up to 100% and cause mulyicolinearity. As far as cancer concerned, dropping the earlier age group made sense. Not to my surprise, this had improved the model fit. 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 830 views
  • 2 likes
  • 2 in conversation