@mmaccora wrote:
Hi,
In SAS Enterprise Miner, I trained a logistic regression with forward selection and AIC criteria.
I grouped rare levels for categorical variables. One of them was selected by the algorithm but the coefficients of all categories were not statistically significant (different from 0).
Why the algorithm would select such a variable if all categories are not significant ? Does someone know a scientific explanation ?
Thank you for your help,
Marco
The stepwise (or forward selection) algorithm selects a variable (or variables) that improve the model the most. This is not the same, and is completely unrelated to, the coefficients of all categories being significant.
But as long as I'm explaining, I will also explain that forward selection (and all stepwise algorithms) are highly discredited methods; they work poorly and don't have as good results as other methods. If you go to your favorite search engine and search for "problems with stepwise regression", you can read so much material on this subject that you won't be done until 2018.
Advice: just because you can use forward selection doesn't mean you should use forward selection. A technique that produces better model (better mean smaller root mean square error of predicted values, and smaller root mean square error of the model coefficients) is Partial Least Squares regression. Reference: http://asq.org/qic/display-item/index.html?item=13552
What were the p-values?
Then I suspect the p-values are less than the p-value cut off that was specified in the restrictions.
You said earlier it was 0.05?
If you can post the parameter estimates table that may help.
But it may be that was the model with the best AIC so the significance of the variables aren't considered. Remember the cutoff of 0.05 is an arbitrary measure, but I'm surprised that p-values of 0.93 would include a categorical variable.
@mmaccora wrote:
Hi,
In SAS Enterprise Miner, I trained a logistic regression with forward selection and AIC criteria.
I grouped rare levels for categorical variables. One of them was selected by the algorithm but the coefficients of all categories were not statistically significant (different from 0).
Why the algorithm would select such a variable if all categories are not significant ? Does someone know a scientific explanation ?
Thank you for your help,
Marco
The stepwise (or forward selection) algorithm selects a variable (or variables) that improve the model the most. This is not the same, and is completely unrelated to, the coefficients of all categories being significant.
But as long as I'm explaining, I will also explain that forward selection (and all stepwise algorithms) are highly discredited methods; they work poorly and don't have as good results as other methods. If you go to your favorite search engine and search for "problems with stepwise regression", you can read so much material on this subject that you won't be done until 2018.
Advice: just because you can use forward selection doesn't mean you should use forward selection. A technique that produces better model (better mean smaller root mean square error of predicted values, and smaller root mean square error of the model coefficients) is Partial Least Squares regression. Reference: http://asq.org/qic/display-item/index.html?item=13552
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.