Programming the statistical procedures from SAS

how to understand "Curse of dimensionality" in statistics?

Reply
Contributor
Posts: 24

how to understand "Curse of dimensionality" in statistics?

I know when we build a predictive model, we don't want to use too many predictor variables because of potential overfitting problem.

But why "too many" predictor variables will result overfitting?

"Curse of dimensionality"  is usually seen in data mining; how to understand "Curse of dimensionality" in statistics? Is there anything to do with overfitting?

Thank you very much.

Respected Advisor
Posts: 2,655

Re: how to understand "Curse of dimensionality" in statistics?

There are at least two curses to watch out for, in my opinion.  Overfitting is one--the more variables you have the more likely you can fit even the most random process results.  Remember, if I have ten data points and nine variables, there is at most one solution satisfying all constraints.  To gain any information on the variables that are in the model, we need substantially more data points.  The other is, loosely speaking, collinearity.  Some variables are just too closely related in some n-space to provided independent information.

And then there comes the problem of prediction.  Simply fitting the existing data becomes easier and easier as we add variables.  However, that does not guarantee the ability to predict.  A validation data set is needed, or at least a good cross-validation plan.  For example, minimizing the mean squared prediction error is a different task than minimizing the residual variability in a training set.

Steve Denham

Respected Advisor
Posts: 4,750

Re: how to understand "Curse of dimensionality" in statistics?

To get a feel of what overfitting can do, try the following little experiment :

data test;
array x{20};
call streaminit(98767);
do rep = 1 to 10;
do i = 1 to 15;
  do j = 1 to 20;
  x{j} = rand("UNIFORM");
  end;
output;
end;
end;
run;

proc reg data=test outest=outest;
by rep;
model x1 = x2-x20/ selection=adjrsq best=1 start=10 stop=10;
run;

proc print data=outest;
format x2-x20 4.1;
id rep;
var x2-x20 _rsq_;
run;

There are 15 observations of 20 totally unrelated (independent) random variables. In each of 10 repetitions, the "best" linear model involving 10 variables "explains" the first variable with near perfection. This is an extreme example of overfitting, but in real life the effect can be subtle enough to go unnoticed.

PG

PG
Contributor
Posts: 24

Re: how to understand "Curse of dimensionality" in statistics?

I truly apologize for the late response and thank you all very much for help.

Ask a Question
Discussion stats
  • 3 replies
  • 312 views
  • 6 likes
  • 3 in conversation