Solved: Categorical or percent of age groups better to adjust the model based ...

Cruise · Posted 01-27-2019 03:52 PM

Dear statistical enthusiasts, I'm trying to assess the relationship between the cancer incidence and the various environmental exposure (&ejvar) using poisson regression model at census tract level. Since cancer incidence is highly age-dependent, I have to adjust my poisson regression model to age.

My question is how to adjust my model to the effect of age?

I came across two main approaches, that could be used but I'm not sure if any of these two are more stat robus than another.

The first approach is to use percent of people for all 14 age groups for each census tract. In this case, my data has 4918 points which is the length to contain all distinct census tract of NY state. (wide format)
The second approach is to use number of people for all 14 age group and listed vertically thus my data has around 4918*14 age groups ~ 68,852 points (long format).

I hope attached images would help to understand what I mean by wide and long formats.

I thought, either way would do and lead to comparable results. However, model fit statistics show that wide format data with percent of each group fit data better based on AIC, BIC and the full likelihood statistics. Also, final results vary depending on the choice of either of these two approaches for the age adjustment.

I'd like to confirm with you guys, which approach is statistically more robust? and worked better from your expeience, if any?

SAS codes for wide and long format data is following.

PROC GENMOD DATA=MYDATA;
MODEL N_TRACT= &ejvar AGE5 AGE59 AGE85 AGE1014 AGE1519 AGE2024 AGE2529 
AGE3034 AGE3539 AGE4044 AGE4549 AGE5054 AGE5559 AGE6064 AGE6569 AGE7074 
AGE7579 AGE8084 POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;

PROC GENMOD DATA=MYDATA;
CLASS AGECAT(REF='10')/PARAM=REF;
MODEL N_TRACT= &ejvar AGECAT POVERTY/DIST=POISSON LINK=LOG 
OFFSET=LN MAXITER=1000;
RUN;

PaigeMiller · Posted 01-28-2019 08:42 AM

A model with CLASS variables is never going to give the same results as a model without CLASS variables.

Further, I don't see how the data from FIPS 36001000100 in the wide format was translated into the FIPS 36001000100 in the long format. But it probably doesn't matter.

Which is correct/better? Well, that probably requires a better understanding of the data than I have, and a better understanding of the end goals of this analysis, which I don't have. But I will say I don't usually like the idea of using CLASS variables when your data really isn't class variables, and in order for this to make sense, you need to have a very strong justification for using the CLASS variables.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 01-28-2019 08:03 AM

So let me see if I am understanding this properly.

In the one case, you are using dummy variables (0s and 1s) and in the other case you are using the actual numeric value of 14 different numeric variables.

Is that right?

--
Paige Miller

Cruise · Posted 01-28-2019 08:23 AM

@PaigeMiller , I appreciate you're picking up on my question. Thanks a lot! I really have to get this right, obvisouly.

No, I'm not using dummy variable. In the wide format with _N_=4918(N of all census tracts within Ny state) all I have and using in the model is shown in below proc print(obs=5). Percent of age groups and poverty by census tract is based on American Community Survey 2011-2015. This makes my proc genmodfollowing;

PROC GENMOD DATA=mydata;
MODEL N_TRACT= X_CANCER AGE5 AGE59 AGE85 AGE1014 AGE1519 AGE2024 AGE2529 AGE3034 
AGE3539 AGE4044 AGE4549 AGE5054 AGE5559 AGE6064 AGE6569 AGE7074 AGE7579 AGE8084 
POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;
QUIT;

Obs	FIPS_TRACT	AGE5	AGE59	AGE1014	AGE1519	AGE2024	AGE2529	AGE3034	AGE3539	AGE4044	AGE4549	AGE5054	AGE5559	AGE6064	AGE6569	AGE7074	AGE7579	AGE8084	AGE85	X_CANCER	POVERTY	LN	TOTAL	N_TRACT
1	36001000100	11	6	5.8	6.8	4.5	5.1	8	5.4	4.8	6.8	7.7	7.5	8.1	6.5	2.5	1.4	1.4	0.7	42.5793	3.8	7.56993	1939	1
2	36001000200	5.5	7.5	6.9	5.5	12.9	8.4	8.2	2.5	9.5	5.7	12	6.1	3.2	2.3	1.5	0.4	1.2	0.9	53.1528	3.7	8.46189	4731	1
3	36001000300	6.6	4.4	9.3	3.1	12.5	8	6.6	2.8	3.7	5	5.4	9.9	3.8	4.2	5.4	3.3	1.8	4	41.8343	3.4	8.62299	5558	2
4	36001000401	4.5	1.1	1.4	1.2	0	7.1	8.3	1.3	5.9	3.1	7.8	4.9	4.2	6.5	3.5	4.4	8.7	25.9	32.4226	6.0	7.80751	2459	0
5	36001000403	6.6	1.8	3.5	7	14.5	7.1	10.5	5.6	4.5	7.1	7.2	4.9	3.9	4.5	5.5	1.7	1.7	2.4	40.1327	1.6	8.45425	4695	1

Cruise · Posted 01-28-2019 08:28 AM

In contrast, long-format data for a single census tract is shown below. Accordingly, my model reflects on it.

PROC GENMOD DATA=mydata;
CLASS AGECAT(REF='10')/PARAM=REF;
MODEL N_TRACT= X_CANCER AGECAT POVERTY/DIST=POISSON LINK=LOG OFFSET=LN MAXITER=1000;
RUN;
QUIT;

Obs	FIPS_TRACT	AGECAT	N_TRACT	POVERTY	X_CANCER	LN	TRACT_POP
1	36001000100	1	0	38.2	42.5793	6.75344	857
2	36001000100	10	0	38.2	42.5793	4.43082	84
3	36001000100	11	0	38.2	42.5793	3.82864	46
4	36001000100	12	0	38.2	42.5793	3.40120	30
5	36001000100	13	0	38.2	42.5793	3.58352	36
6	36001000100	14	0	38.2	42.5793	3.09104	22
7	36001000100	2	0	38.2	42.5793	4.98361	146
8	36001000100	3	0	38.2	42.5793	4.90527	135
9	36001000100	4	1	38.2	42.5793	4.86753	130
10	36001000100	5	0	38.2	42.5793	4.95583	142
11	36001000100	6	0	38.2	42.5793	5.06260	158
12	36001000100	7	0	38.2	42.5793	5.01064	150
13	36001000100	8	0	38.2	42.5793	4.84419	127
14	36001000100	9	0	38.2	42.5793	4.33073	76

Cruise · Posted 01-28-2019 08:34 AM

@PaigeMiller ,

Sorry for separate posts. I thought my posts are long because of screen shots. My question is: which approach adjusts the model to the effect of age appropriately for a population based study like this? My final results between wide vs long-format based approaches vary. That makes me suspicious whether I'm doing wide format (using percent of age groups) right? Someone who does this modelling frequently suggested me a wide-format approach. When you mention dummy, I feel like, hey, am I missing that puzzle in here? If so, how to introduce dummy into this context? Thanks PaigeMiller!

PaigeMiller · Posted 01-28-2019 08:42 AM

A model with CLASS variables is never going to give the same results as a model without CLASS variables.

Further, I don't see how the data from FIPS 36001000100 in the wide format was translated into the FIPS 36001000100 in the long format. But it probably doesn't matter.

Which is correct/better? Well, that probably requires a better understanding of the data than I have, and a better understanding of the end goals of this analysis, which I don't have. But I will say I don't usually like the idea of using CLASS variables when your data really isn't class variables, and in order for this to make sense, you need to have a very strong justification for using the CLASS variables.

--
Paige Miller

Cruise · Posted 01-28-2019 04:12 PM

@PaigeMiller

I agree with everything your said. Make sense to me. In addition, I dropped one of age groups off of the model so the percent doesn't sum up to 100% and cause mulyicolinearity. As far as cancer concerned, dropping the earlier age group made sense. Not to my surprise, this had improved the model fit.

Categorical or percent of age groups better to adjust the model based on census data?

Re: Categorical or percent of age groups better to adjust the model based on census data?

Re: Categorical or percent of age groups better to adjust the model based on census data?

Re: Categorical or percent of age groups better to adjust the model based on census data?

Re: Categorical or percent of age groups better to adjust the model based on census data?

Re: Categorical or percent of age groups better to adjust the model based on census data?

Re: Categorical or percent of age groups better to adjust the model based on census data?

Re: Categorical or percent of age groups better to adjust the model based on census data?

Catch up on SAS Innovate 2026