Hello together,
we are trying to fit a GLM with proc genmod.
The dependent variable is health cost data and independent variables are group of treatment, age, sex, observation time, different comorbidities and different medications.
For modelling the costs, we assumed a gamma distribution und a log link (Meanwhile we also tried other links and distributions).
Now, we are interested to check the goodness of fit of the model.
For this we examined the plot of estimated versus observed costs and the errors versus observed costs.
But in our opinion both plots contradict a good model fit (see attached file). The estimated and observed costs vary randomly whereas the errors show a strong relationship to the observed costs.
Our question is: Are these plots a correct, plausible way to check the model fit for a GLM?
If yes, is there any way to improve the model fit?
We already tried all different link and distribution functions and transformations of the cost data itself.
The cost data are heavily skewed and include zero cost as well as very high costs. But those low and high costs are of interest as well.
Our program:
proc genmod data=input_data PLOTS=(PREDICTED RESCHI);
class group sex
comorbidity1 comorbidity2 ... /* all 1/0 - Variables */
medication1 medication2 ... /* all 1/0 - Variables */
;
model cost = group
age obeservation_time
sex CCI_Score
comobidity1 comorbidity2 ... /* all 1/0 - Variables */
medication1 medication2 ... /* all 1/0 - Variables */
/ dist = gamma link = log ;
output out = Residuals
pred = Pred
resraw = Resraw
reschi = Reschi
;
run;
title 'Proc genmod: Plot of estimated and residuals';
proc gplot data=residuals;
plot pred*cost Reschi*cost;
label cost='cost';
run;
Thanks for an answer
sasstats
I agree that these two plots do not indicate a good fit. However, when you have multiple variables, you need to be a little careful when you create plots like this. You are projecting the predicted responses onto one dimension (cost), whereas a better approach is to slice the predicted response surface. You can use the EFFECTPLOT statement (with the FIT or SLICEFIT options) to create a more effective v... Personally, I don't think it will matter in terms of assessing fit, but the EFFECTPLOT statement is a powerful diagnostic tool that is worth learning about. It should be helpful as you refine your model.
> is there any way to improve the model fit?
We don't really have enough information to answer that question. Two possible approaches are:
1. You can adopt a model-building approach in which you incrementally build up the model based on domain-specific knowledge and looking at the fit statistics. You might be missing interaction terms or nonlinear terms in the model.
2. You can adopt a "shotgun" approach and use PROC GLMSELECT or PROC HPGENSELECT to select the model effects that best fit the data. If you choose to use variable selection, you should consider using crossvalidation to avoid overfitting the data. If you aren't familiar with the model selection procedures, here are two references:
"Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT® Software"
"Introducing the HPGENSELECT Procedure: Model Selection for Generalized Linear Models and More"
Hello Rick_SAS
thanks for your answer and the helpful hints.
It might be that there are interactions between our independent variables. We have to check this.
One request we do have:
Could you please give the correct link for your first reference, if possible?
Now, It leads to the same paper as you second reference.
Thank you very much
sasstats
DONE
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.