BookmarkSubscribeRSS Feed
doudou66
Calcite | Level 5

I know when we build a predictive model, we don't want to use too many predictor variables because of potential overfitting problem.

But why "too many" predictor variables will result overfitting?

"Curse of dimensionality"  is usually seen in data mining; how to understand "Curse of dimensionality" in statistics? Is there anything to do with overfitting?

Thank you very much.

3 REPLIES 3
SteveDenham
Jade | Level 19

There are at least two curses to watch out for, in my opinion.  Overfitting is one--the more variables you have the more likely you can fit even the most random process results.  Remember, if I have ten data points and nine variables, there is at most one solution satisfying all constraints.  To gain any information on the variables that are in the model, we need substantially more data points.  The other is, loosely speaking, collinearity.  Some variables are just too closely related in some n-space to provided independent information.

And then there comes the problem of prediction.  Simply fitting the existing data becomes easier and easier as we add variables.  However, that does not guarantee the ability to predict.  A validation data set is needed, or at least a good cross-validation plan.  For example, minimizing the mean squared prediction error is a different task than minimizing the residual variability in a training set.

Steve Denham

PGStats
Opal | Level 21

To get a feel of what overfitting can do, try the following little experiment :

data test;
array x{20};
call streaminit(98767);
do rep = 1 to 10;
do i = 1 to 15;
  do j = 1 to 20;
  x{j} = rand("UNIFORM");
  end;
output;
end;
end;
run;

proc reg data=test outest=outest;
by rep;
model x1 = x2-x20/ selection=adjrsq best=1 start=10 stop=10;
run;

proc print data=outest;
format x2-x20 4.1;
id rep;
var x2-x20 _rsq_;
run;

There are 15 observations of 20 totally unrelated (independent) random variables. In each of 10 repetitions, the "best" linear model involving 10 variables "explains" the first variable with near perfection. This is an extreme example of overfitting, but in real life the effect can be subtle enough to go unnoticed.

PG

PG
doudou66
Calcite | Level 5

I truly apologize for the late response and thank you all very much for help.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1866 views
  • 6 likes
  • 3 in conversation