Dear community,
I am converting from hypothesis testing to predictive modeling, and wonder whether I can use SAS to assess the contribution of a series of fairly strongly correlated, continuous variables (IV1-IV10) to predict a dichotomous variable (DICH).
Is there a way SAS can automatically output the R-square for each model, as each variable is added?
model 1 DICH = IV1
model 2 DICH = IV1+IV2
model 3 DICH = IV1+IV2+IV3
I understand that the ridge function might be used in SAS to that purpose, but I am so new to this I am having troubles finding where to start. Any suggestion would be greatly appreciated.
Eman
You would have to run PROC LOGISTIC with the RSQUARE option in the MODEL statement. Note: this outputs a generalized R-squared value, not an actual R-squared value, which doesn't really exist in the case of logistic regression.
In order to do this, you would have to run PROC LOGISTIC 10 times, one for each model, save the results and compare them. This seems like something that could be done via a macro, or via brute force. This doesn't take into account the strong correlations you say exist between the X variables; to predict in that case, I would recommend logistic Partial Least Squares modeling, which is quite robust to strong correlations between the x-variables, but sadly there is no SAS PROC to handle this, although the algorithm is available at https://cedric.cnam.fr/fichiers/RC906.pdf and I believe has been programmed in R.
It sounds like your basic goal is to assess the relative importance of the ten candidate predictors. You can do this with several statistics as described and illustrated in this note. One easy way is with a single run of PROC ADAPTIVEREG which can fit the logistic model and provide a variable importance table based on a generalized cross-validation statistic. Or you can use the PCORR option in PROC LOGISTIC which provides partial correlations for each model parameter. Or, use the RsquareV macro to obtain squared partial correlations. Since you are using the fitted full model to obtain these statistics, the effect of each variable is adjusted for the effects of the other variables.
How does any of this address strong correlations between the independent variables?
Thank you all for your responses.
PaigeMiller: i gather from your fist reply that SAS does not offer the correction for multicollinearity; if so, I may have to switch to R for this analysis.
@emaneman wrote:
PaigeMiller: i gather from your fist reply that SAS does not offer the correction for multicollinearity; if so, I may have to switch to R for this analysis.
I guess it depends on what you mean by "correction for multicollinearity". In the logistic case, SAS does offer stepwise regression, which I think is a very dangerous procedure and misleads more than it informs, but you could in some sense call it a "correction for multicollinearity". I can't recommend stepwise regression, in fact I recommend you avoid it like the plague, but not everyone agrees.
It's possible that people have written their own macros to handle multicollinearity in the logistic case, perhaps you should search for them. I have written such a macro, following the paper I linked to, but the macro is not publicly available, and I don't think my employer would want me to share it.
In the non-logistic case of continuous Y variables, SAS does offer the lasso in PROC GLMSELECT and VIF in PROC REG and PROC PLS, which are indeed ways of handling multicollinearity.
I suppose in the logistic case, you could still have PROC REG compute the VIFs, which do not depend on the fact that in the logistic case the Y values are 0 or 1, and will give you some idea of the impact of multicollinearity in the logistic case.
Thank you again. I have read up on these procs, and they seem to be doing what i need.
All the best,
eman
If instead of variable importance, you specifically want to evaluate the question of collinearity, then you can use the method illustrated in this note to get statistics to assess it.
Thank you!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.