BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mmaccora
Obsidian | Level 7
Hi,

In SAS Enterprise Miner, I trained a logistic regression with forward selection and AIC criteria.

I grouped rare levels for categorical variables. One of them was selected by the algorithm but the coefficients of all categories were not statistically significant (different from 0).

Why the algorithm would select such a variable if all categories are not significant ? Does someone know a scientific explanation ?

Thank you for your help,
Marco
1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

@mmaccora wrote:
Hi,

In SAS Enterprise Miner, I trained a logistic regression with forward selection and AIC criteria.

I grouped rare levels for categorical variables. One of them was selected by the algorithm but the coefficients of all categories were not statistically significant (different from 0).

Why the algorithm would select such a variable if all categories are not significant ? Does someone know a scientific explanation ?

Thank you for your help,
Marco

The stepwise (or forward selection) algorithm selects a variable (or variables) that improve the model the most. This is not the same, and is completely unrelated to, the coefficients of all categories being significant.

 

But as long as I'm explaining, I will also explain that forward selection (and all stepwise algorithms) are highly discredited methods; they work poorly and don't have as good results as other methods. If you go to your favorite search engine and search for "problems with stepwise regression", you can read so much material on this subject that you won't be done until 2018.

 

Advice: just because you can use forward selection doesn't mean you should use forward selection. A technique that produces better model (better mean smaller root mean square error of predicted values, and smaller root mean square error of the model coefficients) is Partial Least Squares regression. Reference: http://asq.org/qic/display-item/index.html?item=13552

--
Paige Miller

View solution in original post

7 REPLIES 7
Reeza
Super User

What were the p-values?

Reeza
Super User

Then I suspect the p-values are less than the p-value cut off that was specified in the restrictions. 

 

 

 

mmaccora
Obsidian | Level 7
The p-values for the categories were around .93
Reeza
Super User

You said earlier it was 0.05?

 

If you can post the parameter estimates table that may help.

But it may be that was the model with the best AIC so the significance of the variables aren't considered. Remember the cutoff of 0.05 is an arbitrary measure, but I'm surprised that p-values of 0.93 would include a categorical variable.

mmaccora
Obsidian | Level 7
To sum up, the test level is .05 but the p-values for the categories are around .93.
PaigeMiller
Diamond | Level 26

@mmaccora wrote:
Hi,

In SAS Enterprise Miner, I trained a logistic regression with forward selection and AIC criteria.

I grouped rare levels for categorical variables. One of them was selected by the algorithm but the coefficients of all categories were not statistically significant (different from 0).

Why the algorithm would select such a variable if all categories are not significant ? Does someone know a scientific explanation ?

Thank you for your help,
Marco

The stepwise (or forward selection) algorithm selects a variable (or variables) that improve the model the most. This is not the same, and is completely unrelated to, the coefficients of all categories being significant.

 

But as long as I'm explaining, I will also explain that forward selection (and all stepwise algorithms) are highly discredited methods; they work poorly and don't have as good results as other methods. If you go to your favorite search engine and search for "problems with stepwise regression", you can read so much material on this subject that you won't be done until 2018.

 

Advice: just because you can use forward selection doesn't mean you should use forward selection. A technique that produces better model (better mean smaller root mean square error of predicted values, and smaller root mean square error of the model coefficients) is Partial Least Squares regression. Reference: http://asq.org/qic/display-item/index.html?item=13552

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2169 views
  • 0 likes
  • 3 in conversation