turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Calculating Mallow's CP correctly

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-16-2011 06:19 PM

Hi all,

In order to assess the aptness of several possible subsets for multiple regression, I wanted to use amongst others Mallow's CP criterion. However, a very strange thing seems to be happening. When I perform the following commands, running on exactly the same data set, different CP values for the same model seem to appear. The first command:

*proc reg data=model2 ;*

model lny = X3-X8 X12-X22/selection=rsquare adjrsq cp press mse sse;

run;

quit;

This generated, as wanted, a list with all possible combinations of subsets of X variables, together with the specified selection criteria, such as CP. Then running a second command focusing on one particalur model:

*proc reg data=model2 outest=temp ;*

model lny = X3 X5 X12 X14 X15 X16 X17 X19 X20 X22/cp;

run;

quit;

The CP values for this specific model by using the first command differs compared to the second one. The only cause I could think of is that some other definitions for CP are used by both commands, due to the "selection" statement or something?

Can anyone understand the possible cause of this?

In order to assess the aptness of several possible subsets for multiple regression, I wanted to use amongst others Mallow's CP criterion. However, a very strange thing seems to be happening. When I perform the following commands, running on exactly the same data set, different CP values for the same model seem to appear. The first command:

model lny = X3-X8 X12-X22/selection=rsquare adjrsq cp press mse sse;

run;

quit;

This generated, as wanted, a list with all possible combinations of subsets of X variables, together with the specified selection criteria, such as CP. Then running a second command focusing on one particalur model:

model lny = X3 X5 X12 X14 X15 X16 X17 X19 X20 X22/cp;

run;

quit;

The CP values for this specific model by using the first command differs compared to the second one. The only cause I could think of is that some other definitions for CP are used by both commands, due to the "selection" statement or something?

Can anyone understand the possible cause of this?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2011 07:31 AM

Just a guess. PROC REG handles missing values such that if any variable needed for any regression is missing, the observation is excluded from all estimates. If you had missing values for some of the independent variables that were NOT included in the final model, then the sample size, and hence Cp, would be different.

Steve Denham

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-17-2011 09:06 AM

Dear Steve,

Thanks for your fast reply.

The data set I was using to run both commands is quite "clean" in the sense that no missing values are present: for each case all variables have a specified value, so I don't think the problem is related to that aspect...

Kind regards,

Peter

Thanks for your fast reply.

The data set I was using to run both commands is quite "clean" in the sense that no missing values are present: for each case all variables have a specified value, so I don't think the problem is related to that aspect...

Kind regards,

Peter

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-22-2011 04:25 PM

From the SAS online documentation, here is the definition of Cp:

Cp = [(SSEp)/(s2)] - (N - 2p)

where s2[=s**2] is the MSE for the full model, and SSEp is the

sum-of-squares error for a model with p parameters

Since s2 is the MSE for the full model (which includes all candidate variables), then changing the set of candidate variables will change the value of s2. Hence, you can expect to get a different value for Mallow's Cp if you change the set of candidate variables.

Now, if your restricted set of candidate variables includes all of the important variables, then the expectation for s2 in the restricted and complete variable sets should be the same. So, you might not see much difference in Mallows' Cp if the restricted variable list contains all of the important predictors. But if the restricted set results in the loss of important predictors, then E(s2) for the restricted variable set will be larger than E(s2) for the full variable set. In that case, Mallows' Cp should go down in the restricted variable set.

Cp = [(SSEp)/(s2)] - (N - 2p)

where s2[=s**2] is the MSE for the full model, and SSEp is the

sum-of-squares error for a model with p parameters

Since s2 is the MSE for the full model (which includes all candidate variables), then changing the set of candidate variables will change the value of s2. Hence, you can expect to get a different value for Mallow's Cp if you change the set of candidate variables.

Now, if your restricted set of candidate variables includes all of the important variables, then the expectation for s2 in the restricted and complete variable sets should be the same. So, you might not see much difference in Mallows' Cp if the restricted variable list contains all of the important predictors. But if the restricted set results in the loss of important predictors, then E(s2) for the restricted variable set will be larger than E(s2) for the full variable set. In that case, Mallows' Cp should go down in the restricted variable set.