BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
James18
Obsidian | Level 7

Hello,

 

I am working on variable selection using a purposeful modeling strategy (rather than stepwise) and could use some guidance on what Proc statement would best fit my dataset to produce accurate estimates. In addition, I am using an effect modifier in my dataset and am adding coviarites to the model 1-by-1 to see if the addition of a new covariate is a better model fit (using the AIC or -2 log likelihood value) Here is a little information about my data.

 

Predictor (y): count data (cannot take on the value of non-negative integers)

Main exposure: binary 

Effect modifier: Categorical

8 potential covariates (all categorical)

 

1) s the Proc HPGENSELECT the appropriate procedure if I only have around 350 observations?

If so, should I use the Proc GENMOD procedure for variable selection instead?

 

2) If I can use the HPGENSELECT procedure, do I need to specify dist= poisson and link= identity to produce more accurate estimates?

I ran 2 different models to see how the AIC score would change, and they were drastically different when I specified that the distribution is poisson.

 

Model 1: AIC = 2778.45

Proc hpgenselect data=work.example ;
class X_variable  Effect_modifier ;
model Y_variable = X_variable       Effect_modifier          X_Variable*Effect_modifier / cl ;
run;

 

Model 2: AIC = 4650.43

Proc hpgenselect data=work.example ;
class X_variable  Effect_modifier ;
model Y_variable = X_variable       Effect_modifier          X_Variable*Effect_modifier / cl dist= poisson link= identity  ;
run;

I would like to note that predictor is non normally distributed (skewed right) but homoscedasticity and linearity are not violated. 

 

3) Lastly, originally I specified the X_variable with a reference option (ref = XXX), but the estimates did not seem correct. Would it be more appropriate to leave the default option for class parameterization as GLM?

 

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ
The distribution used is determined by the DIST= option setting, not by the LINK= setting. Various links can be used with any given distribution, but the default is generally used because it avoids most problems when fitting the model. But there is no guarantee that a particular link will be more accurate than a different link. Regarding comparisons, don't use the ESTIMATE statement when easier statements, like the LSMEANS statement, can be used. In your case:
lsmeans x_variable / ilink cl;
The ILINK option adds the Mean column in the resulting table and this contains the estimate of the mean of the negative binomial for each level of x_variable. CL gives confidence limits.

View solution in original post

12 REPLIES 12
StatDave
SAS Super FREQ
I assume that your y variable that you describe as a count "predictor" is actually response (dependent) variable and is the Y_variable variable in your PROC steps. If it is a count variable, I assume that its value *are* non-negative integers rather than not as you stated. Given that, an appropriate response distribution is a discrete distribution like Poisson or negative binomial. The normal distribution is continuous, symmetric, and can take negative values, so it is not strictly correct though it might be a reasonable approximation in some cases. HPGENSELECT, GENMOD, GLIMMIX and others can fit models using the Poisson and negative binomial distributions, so any of them can be used. The reference value you use for a CLASS variable is a matter of convenience and does not affect the overall model fit. The parameters for non-reference levels are the difference in effect between each non-reference level and the reference level.
James18
Obsidian | Level 7

I appreciate the feedback! I have a follow-up question. My outcome is based on a Likert scale from a questionnaire, and the values from each question were added together (what we call the sum likert score). Would it make sense in this situation to use Poisson regression within the Proc Genmod procedure since the outcome is technically discrete (cannot take on any negative integers and the highest value possible is established)?

 

Any feedback would be appreciated!

StatDave
SAS Super FREQ
Poisson, generalized Poisson, or negative binomial might be reasonable since your response is positive integers. If your sum values are relatively large and tend to be distributed symmetrically in a given population, it might be fine to just use the normal distribution.
James18
Obsidian | Level 7

Thank you! My predictor is right skewed and the sum value doesn't get much larger than 70 across the 20 questions (scaled 1 to 5 on likert) so it seems reasonable to do a Poisson model.

James18
Obsidian | Level 7

Quick update: 

I modeled the data using poisson regression which seemed appropriate, but there was one major concern. I had severe overdispersion because my mean for the predictor (approx. 10) was not even close to the variance (approx. 110). 

 

Because of this, I decided to use a negative binomial distribution as suggested. Additionally, I used the pearson chi-squared value to test if it this model was a good fit. The model was an excellent fit (p=0.988), but I am not puzzled with the interpretation of the data.

 

Question: In my Proc Genmod procedure, I used the link= identity function instead of the default of link=log as my predictor is continuous (but based upon count data). Would it be appropriate to interpret this normally as I would because I used the link=identity function or would the interpretation still need to be in a log scale because I used a negative binomial distribution? This may be a difficult question to answer with the limited information

 

This was similar to the code I used for anyone else that is interested: 

proc genmod data=work.source;
class x_variable (ref= 'reference_group') ;
model y_variable = x_variable / cl dist=negbin link= identity type3 ;
run; 

data test;
pval = 1- probchi(361, 425)
run;

 

StatDave
SAS Super FREQ
The choice of the link function should not have anything to do with the nature of your predictor. But again, I wonder if you are using the word "predictor" to actually refer to your response (dependent variable), y_variable. Predictors in a model are the independent variables - x_variable in your case. Since x_variable is in the CLASS statement, you have declared your predictor variable as categorical. If your response, y_variable, is a count, then it is discrete, not continuous. For a discrete, count response, it is most typical to use the log link which ensures that the predicted mean values from the model are positive values. Use of the identity link could result (depending on the estimated parameters) in negative predicted mean values which would not be valid. But that is not to say that the identity link wouldn't work adequately for some data. If you use the log link then your model is log(mean(y))=a+b_i*x_i. If you use the identity link, then the model is mean(y)=a+b_i*x_i, essentially like a model fit by PROC GLM for a normally-distributed response - the log scale is not involved.
James18
Obsidian | Level 7

You are correct! I made the mistake of referring to the response (dependent variables) as the predictor. Thank you for catching that.

 

This helps tremendously. This is a difficult concept to grasp, but it makes more sense now. So, technically it would not be appropriate to say a negative binomial distribution was used if the link= identity function was used? A more accurate prediction would use a link=log function, and I am assuming estimate statements could be used to exponentiate comparison groups of the predictor (x_variable) to allow for easier interpretations. I added the estimate statement as:

 

estimate 'X_variable' x_variable 1 0 / exp;

But, I must have encountered another concern as the results are nonestimable.

James18_0-1669829844851.png

 

StatDave
SAS Super FREQ
The distribution used is determined by the DIST= option setting, not by the LINK= setting. Various links can be used with any given distribution, but the default is generally used because it avoids most problems when fitting the model. But there is no guarantee that a particular link will be more accurate than a different link. Regarding comparisons, don't use the ESTIMATE statement when easier statements, like the LSMEANS statement, can be used. In your case:
lsmeans x_variable / ilink cl;
The ILINK option adds the Mean column in the resulting table and this contains the estimate of the mean of the negative binomial for each level of x_variable. CL gives confidence limits.
James18
Obsidian | Level 7

You deserve a raise sir. Thank you! 

 

This will be very useful for the future. Additionally, I found an article that makes an argument for using parametric tests even in situations for count data that is not normally distributed which would contend how these results could be interpreted (Proc GENMOD w/ normal distribution vs Proc GENMOD w/ negative binomial distribution). It is always difficult to determine what the model of best fit should be.

 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886444/

 

James18
Obsidian | Level 7

I have one final question that may be difficult to answer. 

I would like to display the results from the DISTPLOT in the LSMeans statement in the exponentiated form rather than the log scale after running the results with a negative binomial distribution. Is there a way to do this within SAS with ODS Graphics?

 

Code:

Proc genmod data=work.example;
class x_variable Interacting_variable;
model sumlikert = x_variable interacting variable x_variable*interacting variable / cl dist=negbin link= log type3;
LSMEANS X_variable*interacting_variable / ilink cl plot= distplot;
run;

James18_0-1669933436125.png

James18_1-1669933663667.png

 

 

StatDave
SAS Super FREQ

You just need to save the LSMEANS table by adding an ODS OUTPUT statement like:

   ods output lsmeans=lsm;

Print the LSM data set to see the variable names for the columns in the table, and construct a variable (COMBO) that combines the levels of your two predictors. Then you can construct the plot as you want it using PROC SGPLOT. For example 

data lsm; set lsm;
combo=catx("_",x_variable,interacting_variable);
run;
proc sgplot noautolegend;
highlow high=uppermu low=lowermu x=combo / highcap=serif lowcap=serif;
scatter y=mu x=combo;
yaxis grid label="LS-Means (mean scale)";
title "LS-Means on mean scale";
title2 "With 95% Confidence Limits";
run;
StatDave
SAS Super FREQ
Actually, the easier solution is to use MEANPLOT plot type which applies more directly to plotting means and LSMEANS. So, if you specify PLOT=MEANPLOT(ILINK) in the LSMEANS statement, you should get what you want and essentially the same as the SGPLOT approach.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1594 views
  • 3 likes
  • 2 in conversation