- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Re: Predictive Modeling Using Logistic Regression
Would it be possible to clarify why the presence of redundant inputs may increase the risk of overfitting (see page 3-34 of the course text)?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @pvareschi
The presence of redundant variables results in a more complex model than needed, as it increase the number of predictors. Complex models typically suffer from overfitting as the risk to "learn" errors increase (redundant information, which can be noise and far from a real-world setting)
Best,
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @pvareschi
The presence of redundant variables results in a more complex model than needed, as it increase the number of predictors. Complex models typically suffer from overfitting as the risk to "learn" errors increase (redundant information, which can be noise and far from a real-world setting)
Best,
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
👍 Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Redundant variables also cause the regression coefficients to swing wildly in some cases, to the extent that they can wind up with the wrong sign. And this leads to unstable models, and coefficients that are not interpretable.
Or in somewhat more statistical terms, high correlation between the predictor variables inflates the variance of the coefficients, meaning the coefficients can vary widely from the true value.
The above holds true for most modeling techniques. It does not hold true for Partial Least Squares, which can be used in the presence of redundant variables and is much less susceptible to the above issues.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I highly recommend that you reduce redundancy among your predictor variables first before you deal with irrelevancy of the predictor variables to the target variable. Including redundant variables increases the risk of over-fitting because your model has become overly complex and might be too sensitive to the peculiarities in the sample and therefore will not generalize well to new data. The performance of the variable selection methods such as stepwise and backward will be compromised if you have a high degree of multicollinearity among your predictor variables.