BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Re: Predictive Modeling Using Logistic Regression

Would it be possible to clarify why the presence of redundant inputs may increase the risk of overfitting (see page 3-34 of the course text)?

1 ACCEPTED SOLUTION

Accepted Solutions
ed_sas_member
Meteorite | Level 14

Hi @pvareschi 

The presence of redundant variables results in a more complex model than needed, as it increase the number of predictors. Complex models typically suffer from overfitting as the risk to "learn" errors increase (redundant information, which can be noise and far from a real-world setting)

Best,

View solution in original post

4 REPLIES 4
ed_sas_member
Meteorite | Level 14

Hi @pvareschi 

The presence of redundant variables results in a more complex model than needed, as it increase the number of predictors. Complex models typically suffer from overfitting as the risk to "learn" errors increase (redundant information, which can be noise and far from a real-world setting)

Best,

pvareschi
Quartz | Level 8

👍 Thank you!

PaigeMiller
Diamond | Level 26

Redundant variables also cause the regression coefficients to swing wildly in some cases, to the extent that they can wind up with the wrong sign. And this leads to unstable models, and coefficients that are not interpretable.

 

Or in somewhat more statistical terms, high correlation between the predictor variables inflates the variance of the coefficients, meaning the coefficients can vary widely from the true value.

 

The above holds true for most modeling techniques. It does not hold true for Partial Least Squares, which can be used in the presence of redundant variables and is much less susceptible to the above issues.

--
Paige Miller
sasmlp
SAS Employee

I highly recommend that you reduce redundancy among your predictor variables first before you deal with irrelevancy of the predictor variables to the target variable. Including redundant variables increases the risk of over-fitting because your model has become overly complex and might be too sensitive to the peculiarities in the sample and therefore will not generalize well to new data. The performance of the variable selection methods such as stepwise and backward will be compromised if you have a high degree of multicollinearity among your predictor variables.   

 

This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:

Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 3477 views
  • 1 like
  • 4 in conversation