Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Re: The strange Mallows' Cp selection result (Proc REG)

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 03-28-2013 04:14 AM
(2664 views)

Hello everyone,

Here is the case.

I have to select the best regression model that fits given data.

I use the selection method based on Mallows' Cp statistic, and it was ok for every case and peace of data, until I cathced something strange.

Here the result (produced by proc REG).

Number In Model | Cp | R-Square | Adjusted R-Square | AIC | BIC | Variables in model |
---|---|---|---|---|---|---|

5 | . | 1.0000 | . | . | . | r00 r01 r02 r03 r04 |

4 | . | 0.9917 | 0.9583 | 26.4694 | 16.4694 | r00 r02 r03 r04 |

4 | . | 0.9615 | 0.8076 | 35.6473 | 25.6473 | r00 r01 r02 r03 |

4 | . | 0.9549 | 0.7745 | 36.6003 | 26.6003 | r01 r02 r03 r04 |

4 | . | 0.9487 | 0.7437 | 37.3680 | 27.3680 | r00 r01 r03 r04 |

4 | . | 0.9450 | 0.7248 | 37.7931 | 27.7931 | r00 r01 r02 r04 |

3 | . | 0.9438 | 0.8596 | 35.9159 | 27.9159 | r01 r03 r04 |

3 | . | 0.9436 | 0.8590 | 35.9415 | 27.9415 | r00 r01 r04 |

3 | . | 0.9177 | 0.7942 | 38.2105 | 30.2105 | r00 r01 r03 |

3 | . | 0.8965 | 0.7413 | 39.5815 | 31.5815 | r01 r02 r03 |

2 | . | 0.8670 | 0.7783 | 39.0883 | 33.0883 | r01 r03 |

3 | . | 0.8636 | 0.6591 | 41.2374 | 33.2374 | r02 r03 r04 |

3 | . | 0.8574 | 0.6436 | 41.5041 | 33.5041 | r00 r01 r02 |

3 | . | 0.8573 | 0.6433 | 41.5095 | 33.5095 | r00 r02 r03 |

2 | . | 0.8521 | 0.7536 | 39.7233 | 33.7233 | r00 r01 |

3 | . | 0.8470 | 0.6174 | 41.9294 | 33.9294 | r01 r02 r04 |

2 | . | 0.8468 | 0.7447 | 39.9358 | 33.9358 | r01 r04 |

2 | . | 0.8467 | 0.7446 | 39.9384 | 33.9384 | r01 r02 |

1 | . | 0.8467 | 0.8084 | 37.9385 | 33.9385 | r01 |

2 | . | 0.8454 | 0.7424 | 39.9892 | 33.9892 | r02 r03 |

3 | . | 0.8433 | 0.6082 | 42.0732 | 34.0732 | r00 r02 r04 |

2 | . | 0.8413 | 0.7354 | 40.1489 | 34.1489 | r00 r02 |

2 | . | 0.8212 | 0.7019 | 40.8646 | 34.8646 | r02 r04 |

3 | . | 0.8204 | 0.5509 | 42.8909 | 34.8909 | r00 r03 r04 |

2 | . | 0.8204 | 0.7006 | 40.8911 | 34.8911 | r00 r03 |

2 | . | 0.8204 | 0.7006 | 40.8914 | 34.8914 | r00 r04 |

1 | . | 0.8202 | 0.7752 | 38.8980 | 34.8980 | r00 |

2 | . | 0.8196 | 0.6993 | 40.9183 | 34.9183 | r03 r04 |

1 | . | 0.8101 | 0.7627 | 39.2235 | 35.2235 | r04 |

1 | . | 0.7923 | 0.7404 | 39.7619 | 35.7619 | r03 |

1 | . | 0.5200 | 0.4000 | 44.7881 | 40.7881 | r02 |

**Does anyone know what happened to Cp? And why?**

**SAS prints no warnings, no notification about that.**

Here's the underlying data and problem design:

I have prices (it doesn't really matter of what) by states (regions).

I know the value of these prices on a step ahead.

In each state a have a bunch of participants (traders), which have to buy from one or more region.

Task: I need to find a price which describes the average price for traders.

Skipping the data analysis step: I found that trader's price has a strong correlation with a state price (which is natural).

Most of the traders can buy only in one state - so here we simply use predefined linear model.

But some of the traders buy from two or more regions, and I generally can't use the infromation about which states exactly.

So I decided to use a regression selection algo based on Cp statistic.

**And it works great for every trader except one.**

**Could it be data specific (there is no empty values in input dataset).**

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.

3 REPLIES 3

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

**upd!**

Somehow, this problem is omitted if I exclude the intercept from model selection.

Cp is calculated well and model selection works great!

**Should I send all this to support?**

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Exactly! Somehow I missed that...

Well, actually the proposition about the independency of my regressors would be wrong, since (as it comes from the underlying) all prices are highly correlated and dependent by design. But getting the perfect combination is just a coincidence.

As I mentioned before, I solved the problem by restricting the intercept, which I'd had to done in the very beginning...

Thank you very much!

**+**

**and yes... this case was the only one with a short data series (length 6). **

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.