Solved: The strange Mallows' Cp selection result (Proc REG)

ghastly_kitten · Posted 03-28-2013 04:14 AM

Hello everyone,

Here is the case.

I have to select the best regression model that fits given data.

I use the selection method based on Mallows' Cp statistic, and it was ok for every case and peace of data, until I cathced something strange.

Here the result (produced by proc REG).

Number In Model	Cp	R-Square	Adjusted R-Square	AIC	BIC	Variables in model
5	.	1.0000	.	.	.	r00 r01 r02 r03 r04
4	.	0.9917	0.9583	26.4694	16.4694	r00 r02 r03 r04
4	.	0.9615	0.8076	35.6473	25.6473	r00 r01 r02 r03
4	.	0.9549	0.7745	36.6003	26.6003	r01 r02 r03 r04
4	.	0.9487	0.7437	37.3680	27.3680	r00 r01 r03 r04
4	.	0.9450	0.7248	37.7931	27.7931	r00 r01 r02 r04
3	.	0.9438	0.8596	35.9159	27.9159	r01 r03 r04
3	.	0.9436	0.8590	35.9415	27.9415	r00 r01 r04
3	.	0.9177	0.7942	38.2105	30.2105	r00 r01 r03
3	.	0.8965	0.7413	39.5815	31.5815	r01 r02 r03
2	.	0.8670	0.7783	39.0883	33.0883	r01 r03
3	.	0.8636	0.6591	41.2374	33.2374	r02 r03 r04
3	.	0.8574	0.6436	41.5041	33.5041	r00 r01 r02
3	.	0.8573	0.6433	41.5095	33.5095	r00 r02 r03
2	.	0.8521	0.7536	39.7233	33.7233	r00 r01
3	.	0.8470	0.6174	41.9294	33.9294	r01 r02 r04
2	.	0.8468	0.7447	39.9358	33.9358	r01 r04
2	.	0.8467	0.7446	39.9384	33.9384	r01 r02
1	.	0.8467	0.8084	37.9385	33.9385	r01
2	.	0.8454	0.7424	39.9892	33.9892	r02 r03
3	.	0.8433	0.6082	42.0732	34.0732	r00 r02 r04
2	.	0.8413	0.7354	40.1489	34.1489	r00 r02
2	.	0.8212	0.7019	40.8646	34.8646	r02 r04
3	.	0.8204	0.5509	42.8909	34.8909	r00 r03 r04
2	.	0.8204	0.7006	40.8911	34.8911	r00 r03
2	.	0.8204	0.7006	40.8914	34.8914	r00 r04
1	.	0.8202	0.7752	38.8980	34.8980	r00
2	.	0.8196	0.6993	40.9183	34.9183	r03 r04
1	.	0.8101	0.7627	39.2235	35.2235	r04
1	.	0.7923	0.7404	39.7619	35.7619	r03
1	.	0.5200	0.4000	44.7881	40.7881	r02

Does anyone know what happened to Cp? And why?

SAS prints no warnings, no notification about that.

Here's the underlying data and problem design:

I have prices (it doesn't really matter of what) by states (regions).

I know the value of these prices on a step ahead.

In each state a have a bunch of participants (traders), which have to buy from one or more region.

Task: I need to find a price which describes the average price for traders.

Skipping the data analysis step: I found that trader's price has a strong correlation with a state price (which is natural).

Most of the traders can buy only in one state - so here we simply use predefined linear model.

But some of the traders buy from two or more regions, and I generally can't use the infromation about which states exactly.

So I decided to use a regression selection algo based on Cp statistic.

And it works great for every trader except one.

Could it be data specific (there is no empty values in input dataset).

Thanks in advance!

1zmm · Posted 03-28-2013 09:22 AM

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.

View solution in original post

ghastly_kitten · Posted 03-28-2013 07:20 AM

upd!

Somehow, this problem is omitted if I exclude the intercept from model selection.

Cp is calculated well and model selection works great!

Should I send all this to support?

1zmm · Posted 03-28-2013 09:22 AM

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.

ghastly_kitten · Posted 03-28-2013 02:59 PM

Exactly! Somehow I missed that...

Well, actually the proposition about the independency of my regressors would be wrong, since (as it comes from the underlying) all prices are highly correlated and dependent by design. But getting the perfect combination is just a coincidence.

As I mentioned before, I solved the problem by restricting the intercept, which I'd had to done in the very beginning...

Thank you very much!

+

and yes... this case was the only one with a short data series (length 6).

The strange Mallows' Cp selection result (Proc REG)

Re: The strange Mallows' Cp selection result (Proc REG)

Re: The strange Mallows' Cp selection result (Proc REG)

Re: The strange Mallows' Cp selection result (Proc REG)

Re: The strange Mallows' Cp selection result (Proc REG)