Calculating Mallow's CP correctly

Occasional Contributor
Posts: 5

Calculating Mallow's CP correctly

Hi all,

In order to assess the aptness of several possible subsets for multiple regression, I wanted to use amongst others Mallow's CP criterion. However, a very strange thing seems to be happening. When I perform the following commands, running on exactly the same data set, different CP values for the same model seem to appear. The first command:

proc reg data=model2 ;
model lny = X3-X8 X12-X22/selection=rsquare adjrsq cp press mse sse;
run;
quit;

This generated, as wanted, a list with all possible combinations of subsets of X variables, together with the specified selection criteria, such as CP. Then running a second command focusing on one particalur model:

proc reg data=model2 outest=temp ;
model lny = X3 X5 X12 X14 X15 X16 X17 X19 X20 X22/cp;
run;
quit;

The CP values for this specific model by using the first command differs compared to the second one. The only cause I could think of is that some other definitions for CP are used by both commands, due to the "selection" statement or something?

Can anyone understand the possible cause of this?
Posts: 2,655

Re: Calculating Mallow's CP correctly

Just a guess. PROC REG handles missing values such that if any variable needed for any regression is missing, the observation is excluded from all estimates. If you had missing values for some of the independent variables that were NOT included in the final model, then the sample size, and hence Cp, would be different.

Steve Denham
Occasional Contributor
Posts: 5

Re: Calculating Mallow's CP correctly

Dear Steve,

The data set I was using to run both commands is quite "clean" in the sense that no missing values are present: for each case all variables have a specified value, so I don't think the problem is related to that aspect...

Kind regards,

Peter
Regular Contributor
Posts: 171

Re: Calculating Mallow's CP correctly

From the SAS online documentation, here is the definition of Cp:

Cp = [(SSEp)/(s2)] - (N - 2p)

where s2[=s**2] is the MSE for the full model, and SSEp is the
sum-of-squares error for a model with p parameters

Since s2 is the MSE for the full model (which includes all candidate variables), then changing the set of candidate variables will change the value of s2. Hence, you can expect to get a different value for Mallow's Cp if you change the set of candidate variables.

Now, if your restricted set of candidate variables includes all of the important variables, then the expectation for s2 in the restricted and complete variable sets should be the same. So, you might not see much difference in Mallows' Cp if the restricted variable list contains all of the important predictors. But if the restricted set results in the loss of important predictors, then E(s2) for the restricted variable set will be larger than E(s2) for the full variable set. In that case, Mallows' Cp should go down in the restricted variable set.
Discussion stats
• 3 replies
• 2254 views
• 1 like
• 3 in conversation