BookmarkSubscribeRSS Feed
peterdbr
Fluorite | Level 6
Hi all,

In order to assess the aptness of several possible subsets for multiple regression, I wanted to use amongst others Mallow's CP criterion. However, a very strange thing seems to be happening. When I perform the following commands, running on exactly the same data set, different CP values for the same model seem to appear. The first command:

proc reg data=model2 ;
model lny = X3-X8 X12-X22/selection=rsquare adjrsq cp press mse sse;
run;
quit;


This generated, as wanted, a list with all possible combinations of subsets of X variables, together with the specified selection criteria, such as CP. Then running a second command focusing on one particalur model:

proc reg data=model2 outest=temp ;
model lny = X3 X5 X12 X14 X15 X16 X17 X19 X20 X22/cp;
run;
quit;


The CP values for this specific model by using the first command differs compared to the second one. The only cause I could think of is that some other definitions for CP are used by both commands, due to the "selection" statement or something?

Can anyone understand the possible cause of this?
4 REPLIES 4
SteveDenham
Jade | Level 19
Just a guess. PROC REG handles missing values such that if any variable needed for any regression is missing, the observation is excluded from all estimates. If you had missing values for some of the independent variables that were NOT included in the final model, then the sample size, and hence Cp, would be different.

Steve Denham
peterdbr
Fluorite | Level 6
Dear Steve,

Thanks for your fast reply.

The data set I was using to run both commands is quite "clean" in the sense that no missing values are present: for each case all variables have a specified value, so I don't think the problem is related to that aspect...

Kind regards,

Peter
Dale
Pyrite | Level 9
From the SAS online documentation, here is the definition of Cp:

      Cp = [(SSEp)/(s2)] - (N - 2p)

  where s2[=s**2] is the MSE for the full model, and SSEp is the
  sum-of-squares error for a model with p parameters

Since s2 is the MSE for the full model (which includes all candidate variables), then changing the set of candidate variables will change the value of s2. Hence, you can expect to get a different value for Mallow's Cp if you change the set of candidate variables.

Now, if your restricted set of candidate variables includes all of the important variables, then the expectation for s2 in the restricted and complete variable sets should be the same. So, you might not see much difference in Mallows' Cp if the restricted variable list contains all of the important predictors. But if the restricted set results in the loss of important predictors, then E(s2) for the restricted variable set will be larger than E(s2) for the full variable set. In that case, Mallows' Cp should go down in the restricted variable set.
shahd
Quartz | Level 8
could you please explain how to plot cp versus p

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 7922 views
  • 2 likes
  • 4 in conversation