turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- how to understand "Curse of dimensionality" in sta...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-18-2012 11:42 AM

I know when we build a predictive model, we don't want to use too many predictor variables because of potential overfitting problem.

But why "too many" predictor variables will result overfitting?

"Curse of dimensionality" is usually seen in data mining; how to understand "Curse of dimensionality" in statistics? Is there anything to do with overfitting?

Thank you very much.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doudou66

05-18-2012 03:07 PM

There are at least two curses to watch out for, in my opinion. Overfitting is one--the more variables you have the more likely you can fit even the most random process results. Remember, if I have ten data points and nine variables, there is at most one solution satisfying all constraints. To gain any information on the variables that are in the model, we need substantially more data points. The other is, loosely speaking, collinearity. Some variables are just too closely related in some n-space to provided independent information.

And then there comes the problem of prediction. Simply fitting the existing data becomes easier and easier as we add variables. However, that does not guarantee the ability to predict. A validation data set is needed, or at least a good cross-validation plan. For example, minimizing the mean squared prediction error is a different task than minimizing the residual variability in a training set.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doudou66

05-18-2012 03:47 PM

To get a feel of what overfitting can do, try the following little experiment :

data test;

array x{20};

call streaminit(98767);

do rep = 1 to 10;

do i = 1 to 15;

do j = 1 to 20;

x{j} = rand("UNIFORM");

end;

output;

end;

end;

run;

proc reg data=test outest=outest;

by rep;

model x1 = x2-x20/ selection=adjrsq best=1 start=10 stop=10;

run;

proc print data=outest;

format x2-x20 4.1;

id rep;

var x2-x20 _rsq_;

run;

There are 15 observations of 20 totally unrelated (independent) random variables. In each of 10 repetitions, the "best" linear model involving 10 variables "explains" the first variable with near perfection. This is an extreme example of overfitting, but in real life the effect can be subtle enough to go unnoticed.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doudou66

05-21-2012 06:21 PM

I truly apologize for the late response and thank you all very much for help.