BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ghastly_kitten
Fluorite | Level 6

Hello everyone,

Here is the case.

I have to select the best regression model that fits given data.

I use the selection method based on Mallows' Cp statistic, and it was ok for every case and peace of data, until I cathced something strange.

Here the result (produced by proc REG).

         

Number In

Model

CpR-Square

Adjusted

R-Square

AICBICVariables in model
5.1.0000...r00 r01 r02 r03 r04
4.0.99170.958326.469416.4694r00 r02 r03 r04
4.0.96150.807635.647325.6473r00 r01 r02 r03
4.0.95490.774536.600326.6003r01 r02 r03 r04
4.0.94870.743737.368027.3680r00 r01 r03 r04
4.0.94500.724837.793127.7931r00 r01 r02 r04
3.0.94380.859635.915927.9159r01 r03 r04
3.0.94360.859035.941527.9415r00 r01 r04
3.0.91770.794238.210530.2105r00 r01 r03
3.0.89650.741339.581531.5815r01 r02 r03
2.0.86700.778339.088333.0883r01 r03
3.0.86360.659141.237433.2374r02 r03 r04
3.0.85740.643641.504133.5041r00 r01 r02
3.0.85730.643341.509533.5095r00 r02 r03
2.0.85210.753639.723333.7233r00 r01
3.0.84700.617441.929433.9294r01 r02 r04
2.0.84680.744739.935833.9358r01 r04
2.0.84670.744639.938433.9384r01 r02
1.0.84670.808437.938533.9385r01
2.0.84540.742439.989233.9892r02 r03
3.0.84330.608242.073234.0732r00 r02 r04
2.0.84130.735440.148934.1489r00 r02
2.0.82120.701940.864634.8646r02 r04
3.0.82040.550942.890934.8909r00 r03 r04
2.0.82040.700640.891134.8911r00 r03
2.0.82040.700640.891434.8914r00 r04
1.0.82020.775238.898034.8980r00
2.0.81960.699340.918334.9183r03 r04
1.0.81010.762739.223535.2235r04
1.0.79230.740439.761935.7619r03
1.0.52000.400044.788140.7881r02

Does anyone know what happened to Cp? And why?

SAS prints no warnings, no notification about that.

Here's the underlying data and problem design:

I have prices (it doesn't really matter of what) by states (regions).

I know the value of these prices on a step ahead.

In each state a have a bunch of participants (traders), which have to buy from one or more region.

Task: I need to find a price which describes the average price for traders.

Skipping the data analysis step: I found that trader's price has a strong correlation with a state price (which is natural).

Most of the traders can buy only in one state - so here we simply use predefined linear model.

But some of the traders buy from two or more regions, and I generally can't use the infromation about which states exactly.

So I decided to use a regression selection algo based on Cp statistic.

And it works great for every trader except one.

Could it be data specific (there is no empty values in input dataset).

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
1zmm
Quartz | Level 8

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model.   Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0.  Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data.  Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable.  For example, you could generate a dependent variable that that simply  sums various combinations of your five independent variables.  This wouldn't be such an informative model.

View solution in original post

3 REPLIES 3
ghastly_kitten
Fluorite | Level 6

upd!


Somehow, this problem is omitted if I exclude the intercept from model selection.

Cp is calculated well and model selection works great!


Should I send all this to support?


1zmm
Quartz | Level 8

The estimate of sigma squared, the error variance, used in the denominator to calculate the first term of Mallow's Cp statistic, is the mean squared error from the full model.   Since the five variables in your data form a full model with the R-squared statistic equal to 1.00 (thus, implying a perfect model fit), this mean squared error equals 0.  Since division by a denominator equalling 0 yields an infinite estimate for the first term of the Cp statistic, SAS does not print this statistic.

Solutions to this problem would be to get more data, use fewer independent variables, or apply a different model/functional form to these data.  Such a perfect model fit implies that more data would "break" your model or that the independent variables you selected yield a linear combination that perfectly mimics your dependent variable.  For example, you could generate a dependent variable that that simply  sums various combinations of your five independent variables.  This wouldn't be such an informative model.

ghastly_kitten
Fluorite | Level 6

Exactly! Somehow I missed that...

Well, actually the proposition about the independency of my regressors would be wrong, since (as it comes from the underlying) all prices are highly correlated and dependent by design. But getting the perfect combination is just a coincidence.

As I mentioned before, I solved the problem by restricting the intercept, which I'd had to done in the very beginning...

Thank you very much!

+

and yes... this case was the only one with a short data series (length 6).

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2683 views
  • 0 likes
  • 2 in conversation