Solved: Re: Variables redundancy and overfitting

pvareschi · Posted 05-30-2020 04:49 AM

Re: Predictive Modeling Using Logistic Regression

Would it be possible to clarify why the presence of redundant inputs may increase the risk of overfitting (see page 3-34 of the course text)?

ed_sas_member · Posted 05-30-2020 05:15 AM

Hi @pvareschi

The presence of redundant variables results in a more complex model than needed, as it increase the number of predictors. Complex models typically suffer from overfitting as the risk to "learn" errors increase (redundant information, which can be noise and far from a real-world setting)

Best,

View solution in original post

ed_sas_member · Posted 05-30-2020 05:15 AM

Hi @pvareschi

The presence of redundant variables results in a more complex model than needed, as it increase the number of predictors. Complex models typically suffer from overfitting as the risk to "learn" errors increase (redundant information, which can be noise and far from a real-world setting)

Best,

pvareschi · Posted 05-30-2020 10:44 AM

👍 Thank you!

PaigeMiller · Posted 05-30-2020 05:26 AM

Redundant variables also cause the regression coefficients to swing wildly in some cases, to the extent that they can wind up with the wrong sign. And this leads to unstable models, and coefficients that are not interpretable.

Or in somewhat more statistical terms, high correlation between the predictor variables inflates the variance of the coefficients, meaning the coefficients can vary widely from the true value.

The above holds true for most modeling techniques. It does not hold true for Partial Least Squares, which can be used in the presence of redundant variables and is much less susceptible to the above issues.

--
Paige Miller

sasmlp · Posted 06-01-2020 12:55 PM

I highly recommend that you reduce redundancy among your predictor variables first before you deal with irrelevancy of the predictor variables to the target variable. Including redundant variables increases the risk of over-fitting because your model has become overly complex and might be too sensitive to the peculiarities in the sample and therefore will not generalize well to new data. The performance of the variable selection methods such as stepwise and backward will be compromised if you have a high degree of multicollinearity among your predictor variables.

Variables redundancy and overfitting

Re: Variables redundancy and overfitting

Re: Variables redundancy and overfitting

Re: Variables redundancy and overfitting

Re: Variables redundancy and overfitting

Re: Variables redundancy and overfitting

SAS Training: Just a Click Away