In order to assess the aptness of several possible subsets for multiple regression, I wanted to use amongst others Mallow's CP criterion. However, a very strange thing seems to be happening. When I perform the following commands, running on exactly the same data set, different CP values for the same model seem to appear. The first command:
This generated, as wanted, a list with all possible combinations of subsets of X variables, together with the specified selection criteria, such as CP. Then running a second command focusing on one particalur model:
The CP values for this specific model by using the first command differs compared to the second one. The only cause I could think of is that some other definitions for CP are used by both commands, due to the "selection" statement or something?
Just a guess. PROC REG handles missing values such that if any variable needed for any regression is missing, the observation is excluded from all estimates. If you had missing values for some of the independent variables that were NOT included in the final model, then the sample size, and hence Cp, would be different.
The data set I was using to run both commands is quite "clean" in the sense that no missing values are present: for each case all variables have a specified value, so I don't think the problem is related to that aspect...
From the SAS online documentation, here is the definition of Cp:
Cp = [(SSEp)/(s2)] - (N - 2p)
where s2[=s**2] is the MSE for the full model, and SSEp is the
sum-of-squares error for a model with p parameters
Since s2 is the MSE for the full model (which includes all candidate variables), then changing the set of candidate variables will change the value of s2. Hence, you can expect to get a different value for Mallow's Cp if you change the set of candidate variables.
Now, if your restricted set of candidate variables includes all of the important variables, then the expectation for s2 in the restricted and complete variable sets should be the same. So, you might not see much difference in Mallows' Cp if the restricted variable list contains all of the important predictors. But if the restricted set results in the loss of important predictors, then E(s2) for the restricted variable set will be larger than E(s2) for the full variable set. In that case, Mallows' Cp should go down in the restricted variable set.