BookmarkSubscribeRSS Feed
yubaraj
Fluorite | Level 6

Hi

I am trying to compare flu vaccination  coded 1 or 0 among immigrant and non-immigrant population.

I have current age category, income quintile, urban/rural , gender and years since migration as variables. 

Years since migration has 5 categories :

1. before 20 years

2. 10-20 years

3. 3-10 years

4. last 2 years 

5. non-migrants 

when I run a logistic regression model, using the following command

 

proc logistic data=surv1 descending;
class agecat(param=ref ref='3') GENDER (param=ref ref='M') rural (param=ref ref='0') ses(param=ref ref='4') imm_cat  (param=ref ref='5');
model doc_visit(ref='0') = agecat GENDER rural ses  imm_cat / risklimits lackfit selection=stepwise slentry=0.1 slstay=0.05 details lackfit;
run;

 

It gives me the nice output with all the variables showing association but, the hosmer and lemmeshow goodness of fit test shows p<0.0001.

I included interaction term for age category and imm_cat it slightly improved the model (interaction is highly significant)  fit but goodness of fit statistics is still p<0.0001.

When I look at the cross tabulation for age category and doctor visit, the relationship among immigrants is inverted U shaped. You are more likely to get vaccine at the middle age and less likely to get if you are younger (12 years-18) or 65+.

However, age and vaccination showed a 'J' shaped relation in non-immigrants. you have highest chances of vaccination if you are oldest and least chance at the middle age. 

I was wondering if this reverse relationship between age category and outcome among migrants and non-migrants  is the cause for logistic model not fitting.

Someone suggested me to include spline effect for age (age as a continuous variable) and I included "agesp" in the model but the model is still not fit.

 effect agesp = spline(ageyrs / naturalcubic basis=tpf(noint) knotmethod=percentiles(5));

I also tried to fit using general linear models but the same problem.  am using SAS 9.4 . Apologies I can not share the data.

It would be nice to hear your suggestion, 

Thanks

Yuba 

10 REPLIES 10
PaigeMiller
Diamond | Level 26

What question is this model supposed to answer?

 

You can't fit a spline to agecat, because it is not continuous, you can't say that "non-migrants" are on the same scale as the <2years, 3-10yrs, etc.

--
Paige Miller
yubaraj
Fluorite | Level 6
The question I am trying to answer is :
1. Do immigrants have a different vaccination probability compared to non-migrants?
2. Does years since migration has an effect on vaccination?
I fitted spline to age as a continuous variable in the second attempt, and did not include age as a categorical variable.
Do you suggest just having migrants and non-migrants (two categories for the question 1).
and model only selecting sample of immigrants for question 2.
PaigeMiller
Diamond | Level 26

Given the design, as I understand it, where you have age groups for migrants but not for non-migrants (am I understanding that properly?), then I think the best thing you can do is to consider the migration variable as having five categories as you described in your original message

 

1. before 20 years

2. 10-20 years

3. 3-10 years

4. last 2 years

5. non-migrants

 

and then treating this as a CLASS variable in PROC LOGISTIC. You could then determine if the linear effect of age using just the first four groups was statistically significant, but you could not come to any conclusion about the effect of age for non-migrants.

 

Or maybe you could analyze the migrants only, and leave the non-migrants out, in which case you can easily determine the effect of age on migrants. Again, you could not come to any conclusion about the effect of age for non-migrants.

 

Probably too late, but a better design would have been to collect ages for non-migrants as well.

--
Paige Miller
yubaraj
Fluorite | Level 6

Hi

Probably my message was not clear.

I have age information (age category) for both migrants and non migrants. 

 12-17 yrs, 18-29 yrs, 30-49 yrs, 50-64 yrs and 65 yrs and above. 

The variable I mentioned before was  time since migration which was categorized into 5 categories based on how many years passsed since first migration.

The hypothesis was that you will face more barriers if you are a recent migrant.

Time since migration was 4 categories, and since non-migrants would not have this data, they were on the fifth category.

1. before 20 years (time since migration is >20 yrs)

2. 10-20 years ( (time since migration is 10-20 yrs)

3. 3-10 years  (time since migration is  3-10 yrs)

4. last 2 years  (migrated in last 2 years from the study date)

5. non-migrants

 

I also tried merging category 1-4 (migrants) and comparison group as non-migrants.

There is a significant interaction between migration status and current age category. The hosmer lemeshhow goodness of fit statistics shows poor model fit (p<0.0001)

when I look at bivariate relationship between age category and vaccination by migration status there is a reverse relationship. I.e. oldest age category (65 yrs and above) have the highest vaccination rate in non-migrants whereas migrants have the lowest vaccination rate in 65 yrs and above. 

In fact, the relationship between age category and vacination (OUTCOME) has  ' inveretd U shape ' relation, and among non-migrants is 'J' shaped relation.

I have current age as a both continuous variable and categorical variable for both migrants and non-migrants.

Thank you

 

 

PaigeMiller
Diamond | Level 26

Probably my message was not clear.

I have age information (age category) for both migrants and non migrants. 

12-17 yrs, 18-29 yrs, 30-49 yrs, 50-64 yrs and 65 yrs and above. 

The variable I mentioned before was  time since migration which was categorized into 5 categories based on how many years passsed since first migration.

 

So in light of this new information/clarification, what is your question?

--
Paige Miller
SteveDenham
Jade | Level 19

I almost always prefer a modeling approach to questions like this, but this looks to me to be a classic Cochran-Mantel-Haenszel situation.  Stratifying by age category, you get five 2x2 tables (immigrant status by injection status).  Have you looked at this approach?

 

proc freq data=yourdata;
tables agecat*imm_cat*doc_visit/all commonriskdiff;
run;

I realize this doesn't address gender, income, time since immigration, etc., but this high level analysis will point out whether you need to include other factors.  Since you are employing some sort of stepwise variable selection, it might be good to see first if there is a common risk difference across the age categories.  Then a deeper dive could explain why it differs.

 

SteveDenham

 

yubaraj
Fluorite | Level 6

So, this is the model I am using

proc logistic data=surv descending;
class age_cat(param=ref ref='4') GENDER (param=ref ref='M') rural (param=ref ref='1') INC_QUINT(param=ref ref='Q5') imm_ref_status(param=ref ref='0');
model full_vaccination1 = age_cat GENDER rural INC_QUINT imm_ref_status / risklimits lackfit selection=stepwise slentry=0.1 slstay=0.05 details lackfit;
run;

The result shows everything in the model is significant. However, poor model fit. Hosmer Lemesshow goodness of fit is (p<0.0001).
Adding interaction term (age_cat*imm_ref_status) slightly improves the model fit but still (p<0.0001) .
Adding higher order term for age as a continous variable I.e age*age, age*age*age and adding spline for 'age' also does not work. Also tried using log transformation for the age as a continuous variable also does not provide a model fit.
So, my question is what are the alternatives when model fit is not achieved. or should I change my modelling strategy. or any other suggestions please. 
Thank you

SteveDenham
Jade | Level 19

Well, perhaps Hosmer-Lemeshow isn't the best choice for goodness of fit.  Per the documentation:

 

"Hosmer and Lemeshow (2000) proposed a statistic that they show, through simulation, is distributed as chi-square when there is no replication in any of the subpopulations"

I suspect that your categorical variables lead to substantial replication within some of the subpopulations.

 

When you do have replication and a fair sized sample, check on using the AGGREGATE statement so you can get the Pearson and Deviance goodness of fit tests.  If the data get sparse with all of this, the Details section outlines several other methods.

 

SteveDenham

yubaraj
Fluorite | Level 6

Thanks SteveDenham for your suggestions

your approach showed that there are  heterogenous odds ratios for  vaccination visits by  migration status depending on different age category. 

That means chances of vaccination by migration status is dependent  on  third variable (i.e) which age category  you are. 

That means including interaction terms for migration status (imm_ref_status) and age category (age_cat) was a reasonable approach. 

But this still does not answer my questions. 

 what are the alternatives when model fit is not achieved. or should I change my modelling strategy. or any other suggestions please. 


Thank you

yubaraj
Fluorite | Level 6
proc logistic data=surv1 descending;
class  age_cat (param=ref ref='4') GENDER (param=ref ref='M') rural (param=ref ref='1') INC_QUINT (param=ref ref='Q5') imm_ref_status(param=ref ref='0');
model full_vaccination1 =   age_cat  GENDER  rural INC_QUINT_2016 age_cat*imm_ref_status  / risklimits  selection=stepwise slentry=0.1 slstay=0.05 details  aggregate=( age_cat imm_ref_status INC_QUINT rural GENDER)scale=noone;
run;

I run the model with aggregate statement, and scale=noone

Deviance and Pearson GF test are both p<0.0001.

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1394 views
  • 0 likes
  • 3 in conversation