Solved: Variable selection using Proc HPGenSelect

James18 · Posted 11-23-2022 11:35 AM

Hello,

I am working on variable selection using a purposeful modeling strategy (rather than stepwise) and could use some guidance on what Proc statement would best fit my dataset to produce accurate estimates. In addition, I am using an effect modifier in my dataset and am adding coviarites to the model 1-by-1 to see if the addition of a new covariate is a better model fit (using the AIC or -2 log likelihood value) Here is a little information about my data.

Predictor (y): count data (cannot take on the value of non-negative integers)

Main exposure: binary

Effect modifier: Categorical

8 potential covariates (all categorical)

1) s the Proc HPGENSELECT the appropriate procedure if I only have around 350 observations?

If so, should I use the Proc GENMOD procedure for variable selection instead?

2) If I can use the HPGENSELECT procedure, do I need to specify dist= poisson and link= identity to produce more accurate estimates?

I ran 2 different models to see how the AIC score would change, and they were drastically different when I specified that the distribution is poisson.

Model 1: AIC = 2778.45

Proc hpgenselect data=work.example ;
class X_variable Effect_modifier ;
model Y_variable = X_variable Effect_modifier X_Variable*Effect_modifier / cl ;
run;

Model 2: AIC = 4650.43

Proc hpgenselect data=work.example ;
class X_variable Effect_modifier ;
model Y_variable = X_variable Effect_modifier X_Variable*Effect_modifier / cl dist= poisson link= identity ;
run;

I would like to note that predictor is non normally distributed (skewed right) but homoscedasticity and linearity are not violated.

3) Lastly, originally I specified the X_variable with a reference option (ref = XXX), but the estimates did not seem correct. Would it be more appropriate to leave the default option for class parameterization as GLM?

Thank you

StatDave · Posted 11-30-2022 05:54 PM

The distribution used is determined by the DIST= option setting, not by the LINK= setting. Various links can be used with any given distribution, but the default is generally used because it avoids most problems when fitting the model. But there is no guarantee that a particular link will be more accurate than a different link. Regarding comparisons, don't use the ESTIMATE statement when easier statements, like the LSMEANS statement, can be used. In your case:
lsmeans x_variable / ilink cl;
The ILINK option adds the Mean column in the resulting table and this contains the estimate of the mean of the negative binomial for each level of x_variable. CL gives confidence limits.

View solution in original post

StatDave · Posted 11-23-2022 04:54 PM

I assume that your y variable that you describe as a count "predictor" is actually response (dependent) variable and is the Y_variable variable in your PROC steps. If it is a count variable, I assume that its value *are* non-negative integers rather than not as you stated. Given that, an appropriate response distribution is a discrete distribution like Poisson or negative binomial. The normal distribution is continuous, symmetric, and can take negative values, so it is not strictly correct though it might be a reasonable approximation in some cases. HPGENSELECT, GENMOD, GLIMMIX and others can fit models using the Poisson and negative binomial distributions, so any of them can be used. The reference value you use for a CLASS variable is a matter of convenience and does not affect the overall model fit. The parameters for non-reference levels are the difference in effect between each non-reference level and the reference level.

James18 · Posted 11-28-2022 02:26 PM

I appreciate the feedback! I have a follow-up question. My outcome is based on a Likert scale from a questionnaire, and the values from each question were added together (what we call the sum likert score). Would it make sense in this situation to use Poisson regression within the Proc Genmod procedure since the outcome is technically discrete (cannot take on any negative integers and the highest value possible is established)?

Any feedback would be appreciated!

StatDave · Posted 11-28-2022 02:49 PM

Poisson, generalized Poisson, or negative binomial might be reasonable since your response is positive integers. If your sum values are relatively large and tend to be distributed symmetrically in a given population, it might be fine to just use the normal distribution.

James18 · Posted 11-28-2022 03:10 PM

Thank you! My predictor is right skewed and the sum value doesn't get much larger than 70 across the 20 questions (scaled 1 to 5 on likert) so it seems reasonable to do a Poisson model.

James18 · Posted 11-30-2022 10:36 AM

Quick update:

I modeled the data using poisson regression which seemed appropriate, but there was one major concern. I had severe overdispersion because my mean for the predictor (approx. 10) was not even close to the variance (approx. 110).

Because of this, I decided to use a negative binomial distribution as suggested. Additionally, I used the pearson chi-squared value to test if it this model was a good fit. The model was an excellent fit (p=0.988), but I am not puzzled with the interpretation of the data.

Question: In my Proc Genmod procedure, I used the link= identity function instead of the default of link=log as my predictor is continuous (but based upon count data). Would it be appropriate to interpret this normally as I would because I used the link=identity function or would the interpretation still need to be in a log scale because I used a negative binomial distribution? This may be a difficult question to answer with the limited information

This was similar to the code I used for anyone else that is interested:

proc genmod data=work.source;
   class x_variable (ref= 'reference_group') ;
   model y_variable = x_variable / cl dist=negbin link= identity  type3 ;
run; 

data test;
pval = 1- probchi(361, 425)
run;

StatDave · Posted 11-30-2022 10:59 AM

The choice of the link function should not have anything to do with the nature of your predictor. But again, I wonder if you are using the word "predictor" to actually refer to your response (dependent variable), y_variable. Predictors in a model are the independent variables - x_variable in your case. Since x_variable is in the CLASS statement, you have declared your predictor variable as categorical. If your response, y_variable, is a count, then it is discrete, not continuous. For a discrete, count response, it is most typical to use the log link which ensures that the predicted mean values from the model are positive values. Use of the identity link could result (depending on the estimated parameters) in negative predicted mean values which would not be valid. But that is not to say that the identity link wouldn't work adequately for some data. If you use the log link then your model is log(mean(y))=a+b_i*x_i. If you use the identity link, then the model is mean(y)=a+b_i*x_i, essentially like a model fit by PROC GLM for a normally-distributed response - the log scale is not involved.

James18 · Posted 11-30-2022 12:39 PM

You are correct! I made the mistake of referring to the response (dependent variables) as the predictor. Thank you for catching that.

This helps tremendously. This is a difficult concept to grasp, but it makes more sense now. So, technically it would not be appropriate to say a negative binomial distribution was used if the link= identity function was used? A more accurate prediction would use a link=log function, and I am assuming estimate statements could be used to exponentiate comparison groups of the predictor (x_variable) to allow for easier interpretations. I added the estimate statement as:

estimate 'X_variable' x_variable 1 0 / exp;

But, I must have encountered another concern as the results are nonestimable.

StatDave · Posted 11-30-2022 05:54 PM

The distribution used is determined by the DIST= option setting, not by the LINK= setting. Various links can be used with any given distribution, but the default is generally used because it avoids most problems when fitting the model. But there is no guarantee that a particular link will be more accurate than a different link. Regarding comparisons, don't use the ESTIMATE statement when easier statements, like the LSMEANS statement, can be used. In your case:
lsmeans x_variable / ilink cl;
The ILINK option adds the Mean column in the resulting table and this contains the estimate of the mean of the negative binomial for each level of x_variable. CL gives confidence limits.

James18 · Posted 11-30-2022 07:16 PM

You deserve a raise sir. Thank you!

This will be very useful for the future. Additionally, I found an article that makes an argument for using parametric tests even in situations for count data that is not normally distributed which would contend how these results could be interpreted (Proc GENMOD w/ normal distribution vs Proc GENMOD w/ negative binomial distribution). It is always difficult to determine what the model of best fit should be.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886444/

James18 · Posted 12-01-2022 05:29 PM

I have one final question that may be difficult to answer.

I would like to display the results from the DISTPLOT in the LSMeans statement in the exponentiated form rather than the log scale after running the results with a negative binomial distribution. Is there a way to do this within SAS with ODS Graphics?

Code:

Proc genmod data=work.example;
class x_variable Interacting_variable;
model sumlikert = x_variable interacting variable x_variable*interacting variable / cl dist=negbin link= log type3;
LSMEANS X_variable*interacting_variable / ilink cl plot= distplot;
run;

StatDave · Posted 12-02-2022 11:23 AM

You just need to save the LSMEANS table by adding an ODS OUTPUT statement like:

ods output lsmeans=lsm;

Print the LSM data set to see the variable names for the columns in the table, and construct a variable (COMBO) that combines the levels of your two predictors. Then you can construct the plot as you want it using PROC SGPLOT. For example

data lsm; set lsm;
combo=catx("_",x_variable,interacting_variable);
run;
proc sgplot noautolegend;
highlow high=uppermu low=lowermu x=combo / highcap=serif lowcap=serif;
scatter y=mu x=combo;
yaxis grid label="LS-Means (mean scale)";
title "LS-Means on mean scale";
title2 "With 95% Confidence Limits";
run;

StatDave · Posted 12-02-2022 12:28 PM

Actually, the easier solution is to use MEANPLOT plot type which applies more directly to plotting means and LSMEANS. So, if you specify PLOT=MEANPLOT(ILINK) in the LSMEANS statement, you should get what you want and essentially the same as the SGPLOT approach.

Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect

Re: Variable selection using Proc HPGenSelect