Hello everyone,
Here is the case.
I have to select the best regression model that fits given data.
I use the selection method based on Mallows' Cp statistic, and it was ok for every case and peace of data, until I cathced something strange.
Here the result (produced by proc REG).
Number In Model | Cp | R-Square | Adjusted R-Square | AIC | BIC | Variables in model |
---|---|---|---|---|---|---|
5 | . | 1.0000 | . | . | . | r00 r01 r02 r03 r04 |
4 | . | 0.9917 | 0.9583 | 26.4694 | 16.4694 | r00 r02 r03 r04 |
4 | . | 0.9615 | 0.8076 | 35.6473 | 25.6473 | r00 r01 r02 r03 |
4 | . | 0.9549 | 0.7745 | 36.6003 | 26.6003 | r01 r02 r03 r04 |
4 | . | 0.9487 | 0.7437 | 37.3680 | 27.3680 | r00 r01 r03 r04 |
4 | . | 0.9450 | 0.7248 | 37.7931 | 27.7931 | r00 r01 r02 r04 |
3 | . | 0.9438 | 0.8596 | 35.9159 | 27.9159 | r01 r03 r04 |
3 | . | 0.9436 | 0.8590 | 35.9415 | 27.9415 | r00 r01 r04 |
3 | . | 0.9177 | 0.7942 | 38.2105 | 30.2105 | r00 r01 r03 |
3 | . | 0.8965 | 0.7413 | 39.5815 | 31.5815 | r01 r02 r03 |
2 | . | 0.8670 | 0.7783 | 39.0883 | 33.0883 | r01 r03 |
3 | . | 0.8636 | 0.6591 | 41.2374 | 33.2374 | r02 r03 r04 |
3 | . | 0.8574 | 0.6436 | 41.5041 | 33.5041 | r00 r01 r02 |
3 | . | 0.8573 | 0.6433 | 41.5095 | 33.5095 | r00 r02 r03 |
2 | . | 0.8521 | 0.7536 | 39.7233 | 33.7233 | r00 r01 |
3 | . | 0.8470 | 0.6174 | 41.9294 | 33.9294 | r01 r02 r04 |
2 | . | 0.8468 | 0.7447 | 39.9358 | 33.9358 | r01 r04 |
2 | . | 0.8467 | 0.7446 | 39.9384 | 33.9384 | r01 r02 |
1 | . | 0.8467 | 0.8084 | 37.9385 | 33.9385 | r01 |
2 | . | 0.8454 | 0.7424 | 39.9892 | 33.9892 | r02 r03 |
3 | . | 0.8433 | 0.6082 | 42.0732 | 34.0732 | r00 r02 r04 |
2 | . | 0.8413 | 0.7354 | 40.1489 | 34.1489 | r00 r02 |
2 | . | 0.8212 | 0.7019 | 40.8646 | 34.8646 | r02 r04 |
3 | . | 0.8204 | 0.5509 | 42.8909 | 34.8909 | r00 r03 r04 |
2 | . | 0.8204 | 0.7006 | 40.8911 | 34.8911 | r00 r03 |
2 | . | 0.8204 | 0.7006 | 40.8914 | 34.8914 | r00 r04 |
1 | . | 0.8202 | 0.7752 | 38.8980 | 34.8980 | r00 |
2 | . | 0.8196 | 0.6993 | 40.9183 | 34.9183 | r03 r04 |
1 | . | 0.8101 | 0.7627 | 39.2235 | 35.2235 | r04 |
1 | . | 0.7923 | 0.7404 | 39.7619 | 35.7619 | r03 |
1 | . | 0.5200 | 0.4000 | 44.7881 | 40.7881 | r02 |
Does anyone know what happened to Cp? And why?
SAS prints no warnings, no notification about that.
Here's the underlying data and problem design:
I have prices (it doesn't really matter of what) by states (regions).
I know the value of these prices on a step ahead.
In each state a have a bunch of participants (traders), which have to buy from one or more region.
Task: I need to find a price which describes the average price for traders.
Skipping the data analysis step: I found that trader's price has a strong correlation with a state price (which is natural).
Most of the traders can buy only in one state - so here we simply use predefined linear model.
But some of the traders buy from two or more regions, and I generally can't use the infromation about which states exactly.
So I decided to use a regression selection algo based on Cp statistic.
And it works great for every trader except one.
Could it be data specific (there is no empty values in input dataset).
Thanks in advance!
The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.
Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.
upd!
Somehow, this problem is omitted if I exclude the intercept from model selection.
Cp is calculated well and model selection works great!
Should I send all this to support?
The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model. Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0. Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.
Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data. Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable. For example, you could generate a dependent variable that that simply sums various combinations of your five independent variables. This wouldn't be such an informative model.
Exactly! Somehow I missed that...
Well, actually the proposition about the independency of my regressors would be wrong, since (as it comes from the underlying) all prices are highly correlated and dependent by design. But getting the perfect combination is just a coincidence.
As I mentioned before, I solved the problem by restricting the intercept, which I'd had to done in the very beginning...
Thank you very much!
+
and yes... this case was the only one with a short data series (length 6).
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.