turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- The strange Mallows' Cp selection result (Proc REG...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-28-2013 04:14 AM

Hello everyone,

Here is the case.

I have to select the best regression model that fits given data.

I use the selection method based on Mallows' Cp statistic, and it was ok for every case and peace of data, until I cathced something strange.

Here the result (produced by proc REG).

Number In Model | Cp | R-Square | Adjusted R-Square | AIC | BIC | Variables in model |
---|---|---|---|---|---|---|

5 | . | 1.0000 | . | . | . | r00 r01 r02 r03 r04 |

4 | . | 0.9917 | 0.9583 | 26.4694 | 16.4694 | r00 r02 r03 r04 |

4 | . | 0.9615 | 0.8076 | 35.6473 | 25.6473 | r00 r01 r02 r03 |

4 | . | 0.9549 | 0.7745 | 36.6003 | 26.6003 | r01 r02 r03 r04 |

4 | . | 0.9487 | 0.7437 | 37.3680 | 27.3680 | r00 r01 r03 r04 |

4 | . | 0.9450 | 0.7248 | 37.7931 | 27.7931 | r00 r01 r02 r04 |

3 | . | 0.9438 | 0.8596 | 35.9159 | 27.9159 | r01 r03 r04 |

3 | . | 0.9436 | 0.8590 | 35.9415 | 27.9415 | r00 r01 r04 |

3 | . | 0.9177 | 0.7942 | 38.2105 | 30.2105 | r00 r01 r03 |

3 | . | 0.8965 | 0.7413 | 39.5815 | 31.5815 | r01 r02 r03 |

2 | . | 0.8670 | 0.7783 | 39.0883 | 33.0883 | r01 r03 |

3 | . | 0.8636 | 0.6591 | 41.2374 | 33.2374 | r02 r03 r04 |

3 | . | 0.8574 | 0.6436 | 41.5041 | 33.5041 | r00 r01 r02 |

3 | . | 0.8573 | 0.6433 | 41.5095 | 33.5095 | r00 r02 r03 |

2 | . | 0.8521 | 0.7536 | 39.7233 | 33.7233 | r00 r01 |

3 | . | 0.8470 | 0.6174 | 41.9294 | 33.9294 | r01 r02 r04 |

2 | . | 0.8468 | 0.7447 | 39.9358 | 33.9358 | r01 r04 |

2 | . | 0.8467 | 0.7446 | 39.9384 | 33.9384 | r01 r02 |

1 | . | 0.8467 | 0.8084 | 37.9385 | 33.9385 | r01 |

2 | . | 0.8454 | 0.7424 | 39.9892 | 33.9892 | r02 r03 |

3 | . | 0.8433 | 0.6082 | 42.0732 | 34.0732 | r00 r02 r04 |

2 | . | 0.8413 | 0.7354 | 40.1489 | 34.1489 | r00 r02 |

2 | . | 0.8212 | 0.7019 | 40.8646 | 34.8646 | r02 r04 |

3 | . | 0.8204 | 0.5509 | 42.8909 | 34.8909 | r00 r03 r04 |

2 | . | 0.8204 | 0.7006 | 40.8911 | 34.8911 | r00 r03 |

2 | . | 0.8204 | 0.7006 | 40.8914 | 34.8914 | r00 r04 |

1 | . | 0.8202 | 0.7752 | 38.8980 | 34.8980 | r00 |

2 | . | 0.8196 | 0.6993 | 40.9183 | 34.9183 | r03 r04 |

1 | . | 0.8101 | 0.7627 | 39.2235 | 35.2235 | r04 |

1 | . | 0.7923 | 0.7404 | 39.7619 | 35.7619 | r03 |

1 | . | 0.5200 | 0.4000 | 44.7881 | 40.7881 | r02 |

**Does anyone know what happened to Cp? And why?**

**SAS prints no warnings, no notification about that.**

Here's the underlying data and problem design:

I have prices (it doesn't really matter of what) by states (regions).

I know the value of these prices on a step ahead.

In each state a have a bunch of participants (traders), which have to buy from one or more region.

Task: I need to find a price which describes the average price for traders.

Skipping the data analysis step: I found that trader's price has a strong correlation with a state price (which is natural).

Most of the traders can buy only in one state - so here we simply use predefined linear model.

But some of the traders buy from two or more regions, and I generally can't use the infromation about which states exactly.

So I decided to use a regression selection algo based on Cp statistic.

**And it works great for every trader except one.**

**Could it be data specific (there is no empty values in input dataset).**

Thanks in advance!

Accepted Solutions

Solution

03-28-2013
09:22 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-28-2013 09:22 AM

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-28-2013 07:20 AM

**upd!**

Somehow, this problem is omitted if I exclude the intercept from model selection.

Cp is calculated well and model selection works great!

**Should I send all this to support?**

Solution

03-28-2013
09:22 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-28-2013 09:22 AM

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-28-2013 02:59 PM

Exactly! Somehow I missed that...

Well, actually the proposition about the independency of my regressors would be wrong, since (as it comes from the underlying) all prices are highly correlated and dependent by design. But getting the perfect combination is just a coincidence.

As I mentioned before, I solved the problem by restricting the intercept, which I'd had to done in the very beginning...

Thank you very much!

**+**

**and yes... this case was the only one with a short data series (length 6). **