Hi,
I am looking into some cost (Dependent Variable) for three (3) different procedure groups (categorical independent variable) and plan to present the mean of predicted cost for each of these three procedure groups. The independent variables will be procedure group, comorbidity, and variables for some socio demographic information. I used generalized linear model (proc genmod) with gamma distribution and log link. I learned SAS code for GLM model and wrote my SAS code as the following. I am not sure if I wrote SAS code in a right way to get result. Please correct me if anything is wrong. Thanks so much!
(1) I put all categorical variables in the CLASS statement;
(2) I put type3 in the MODEL statement to see if there is difference of cost among 3 procedure groups;
From results, it shows table of LR Statistics For Type 3 Analysis. I will go check the p-value for variable procedure_cat to see if there is significant different in the cost for these 3 procedure groups. Is it correct?
(3) I used LSMEANS statement to have mean of predicted payment for each of procedure group, which is why I put variable procedure_cat after LSMEANS statement;
At the bottom of results, it shows procedure_cat Least Squares Means. I will present the mean for each of procedure groups in my table for the mean of predicted cost for each of procedure groups. Is it correct?
(4) I put / ilink after LSMEANS statement as I saw someone suggested to do so since link=log was already defined, but I am still not clear about the reason of using / ilink;
Proc genmod data=abc.mydata;
Class procedure_cat female race_cat income_cat comorbidity_cat;
Model payment = procedure_cat age female race_cat college income_cat comorbidity_cat / dist=gamma link=log type3;
lsmeans procedure_cat / ilink;
Run;
This should all work. Note that the estimates from your LSMEANS statement will be marginal means, averaged over all of the other categories. You may have interactions to think about--for instance if the procedure_cat coded for mammogram, there is a high likelihood of interaction with the categorical variable female (assuming it is a Y/N variable). Looking at marginal means over all categories gives equal weight to each level of each of the categorical variables.
Steve Denham
Thanks Steve! Your comments are very useful!
Is the marginal means the common way to show mean of predicted y hat or do you have any other idea to present the analysis result? Can I use output statement with prob= to get y hat, and then calculate mean of y hat for each of procedure group like the following code?
Proc genmod data=abc.mydata;
Class procedure_cat female race_cat income_cat comorbidity_cat;
Model payment = procedure_cat age female race_cat college income_cat comorbidity_cat / dist=gamma link=log type3;
lsmeans procedure_cat / ilink;
output out=resut prob=p;
Run;
proc means n mean stddev CLM data=result;
class procedure_cat;
var p;
run;
I also have another question about the analysis result. In "analysis of maximum likelihood parameter estimates" table of analysis result, they are missing in the column of wald chi-square and the column of Pr>ChiSq for the last group of each categorical variable (The DF for each of last group of each categorical variables is 0), like the table below. Why is that?
Intercept | 1 | -1.3168 | 0.0903 | -1.4937 | -1.1398 | 212.73 | <.0001 | |
---|---|---|---|---|---|---|---|---|
car | large | 1 | -1.7643 | 0.2724 | -2.2981 | -1.2304 | 41.96 | <.0001 |
car | medium | 1 | -0.6928 | 0.1282 | -0.9441 | -0.4414 | 29.18 | <.0001 |
car | small | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
age | 1 | 1 | -1.3199 | 0.1359 | -1.5863 | -1.0536 | 94.34 | <.0001 |
age | 2 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | . | . |
Scale | 0 | 1.0000 | 0.0000 | 1.0000 | 1.0000 |
Thanks so much for your answer!
Last question first. The estimates are set to zero as these are the reference categories (default is last). An overparameterized model is fit. Solutions involve putting together the estimates into a linear form. For instance, the estimate for a large car for age group=1 would be intercept + estimate (car large) + estimate (age 1) = -1.3168 - 1.7643 - 1.3199 = -4.401 on the linearized scale (for your major project, that would be the log scale).
Now as far as the first question, there really shouldn't be a need to recalculate the mean probability. By using the ilink option in the lsmeans statement, you will get the probability (actually a risk, I think) for each level of procedure_cat. And this may well be different from the value you obtain from the output dataset, followed by proc means, because the assumption here is that the probabilities are additive on the log scale, hence the mean is calculated and then put back onto the original scale with the ilink function. To get something close using the output dataset, I think you will have to apply a log transformation to p, get the mean, and then backtransform to the original scale.
Steve Denham
Thanks Steve!
Thanks Steve! I have one more question about the scale shown at the bottom of model result. The result shows the scale parameter was estimated by maximum likelihood, but I am not sure how to interprete the scale. Could you give me some hints? Thanks so much!
The gamma distribution has two parameters, mu (which may be a function of covariates and treatments, and associated parameters) and a scale parameter. The scale parameter is like a standard deviation or variance (but not quite); at least it serves that purpose (a measure of variability, similar to sigma with normal data). But with the gamma distribution, the variance is a function of the mean. var(Y) = scale*(mu^2).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.