Hello,
I am working on variable selection using a purposeful modeling strategy (rather than stepwise) and could use some guidance on what Proc statement would best fit my dataset to produce accurate estimates. In addition, I am using an effect modifier in my dataset and am adding coviarites to the model 1-by-1 to see if the addition of a new covariate is a better model fit (using the AIC or -2 log likelihood value) Here is a little information about my data.
Predictor (y): count data (cannot take on the value of non-negative integers)
Main exposure: binary
Effect modifier: Categorical
8 potential covariates (all categorical)
1) s the Proc HPGENSELECT the appropriate procedure if I only have around 350 observations?
If so, should I use the Proc GENMOD procedure for variable selection instead?
2) If I can use the HPGENSELECT procedure, do I need to specify dist= poisson and link= identity to produce more accurate estimates?
I ran 2 different models to see how the AIC score would change, and they were drastically different when I specified that the distribution is poisson.
Model 1: AIC = 2778.45
Proc hpgenselect data=work.example ;
class X_variable Effect_modifier ;
model Y_variable = X_variable Effect_modifier X_Variable*Effect_modifier / cl ;
run;
Model 2: AIC = 4650.43
Proc hpgenselect data=work.example ;
class X_variable Effect_modifier ;
model Y_variable = X_variable Effect_modifier X_Variable*Effect_modifier / cl dist= poisson link= identity ;
run;
I would like to note that predictor is non normally distributed (skewed right) but homoscedasticity and linearity are not violated.
3) Lastly, originally I specified the X_variable with a reference option (ref = XXX), but the estimates did not seem correct. Would it be more appropriate to leave the default option for class parameterization as GLM?
Thank you
I appreciate the feedback! I have a follow-up question. My outcome is based on a Likert scale from a questionnaire, and the values from each question were added together (what we call the sum likert score). Would it make sense in this situation to use Poisson regression within the Proc Genmod procedure since the outcome is technically discrete (cannot take on any negative integers and the highest value possible is established)?
Any feedback would be appreciated!
Thank you! My predictor is right skewed and the sum value doesn't get much larger than 70 across the 20 questions (scaled 1 to 5 on likert) so it seems reasonable to do a Poisson model.
Quick update:
I modeled the data using poisson regression which seemed appropriate, but there was one major concern. I had severe overdispersion because my mean for the predictor (approx. 10) was not even close to the variance (approx. 110).
Because of this, I decided to use a negative binomial distribution as suggested. Additionally, I used the pearson chi-squared value to test if it this model was a good fit. The model was an excellent fit (p=0.988), but I am not puzzled with the interpretation of the data.
Question: In my Proc Genmod procedure, I used the link= identity function instead of the default of link=log as my predictor is continuous (but based upon count data). Would it be appropriate to interpret this normally as I would because I used the link=identity function or would the interpretation still need to be in a log scale because I used a negative binomial distribution? This may be a difficult question to answer with the limited information
This was similar to the code I used for anyone else that is interested:
proc genmod data=work.source;
class x_variable (ref= 'reference_group') ;
model y_variable = x_variable / cl dist=negbin link= identity type3 ;
run;
data test;
pval = 1- probchi(361, 425)
run;
You are correct! I made the mistake of referring to the response (dependent variables) as the predictor. Thank you for catching that.
This helps tremendously. This is a difficult concept to grasp, but it makes more sense now. So, technically it would not be appropriate to say a negative binomial distribution was used if the link= identity function was used? A more accurate prediction would use a link=log function, and I am assuming estimate statements could be used to exponentiate comparison groups of the predictor (x_variable) to allow for easier interpretations. I added the estimate statement as:
estimate 'X_variable' x_variable 1 0 / exp;
But, I must have encountered another concern as the results are nonestimable.
You deserve a raise sir. Thank you!
This will be very useful for the future. Additionally, I found an article that makes an argument for using parametric tests even in situations for count data that is not normally distributed which would contend how these results could be interpreted (Proc GENMOD w/ normal distribution vs Proc GENMOD w/ negative binomial distribution). It is always difficult to determine what the model of best fit should be.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886444/
I have one final question that may be difficult to answer.
I would like to display the results from the DISTPLOT in the LSMeans statement in the exponentiated form rather than the log scale after running the results with a negative binomial distribution. Is there a way to do this within SAS with ODS Graphics?
Code:
Proc genmod data=work.example;
class x_variable Interacting_variable;
model sumlikert = x_variable interacting variable x_variable*interacting variable / cl dist=negbin link= log type3;
LSMEANS X_variable*interacting_variable / ilink cl plot= distplot;
run;
You just need to save the LSMEANS table by adding an ODS OUTPUT statement like:
ods output lsmeans=lsm;
Print the LSM data set to see the variable names for the columns in the table, and construct a variable (COMBO) that combines the levels of your two predictors. Then you can construct the plot as you want it using PROC SGPLOT. For example
data lsm; set lsm;
combo=catx("_",x_variable,interacting_variable);
run;
proc sgplot noautolegend;
highlow high=uppermu low=lowermu x=combo / highcap=serif lowcap=serif;
scatter y=mu x=combo;
yaxis grid label="LS-Means (mean scale)";
title "LS-Means on mean scale";
title2 "With 95% Confidence Limits";
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.