Solved: Re: SAS computation for AIC in Proc Reg

Frances · Posted 04-10-2010 01:45 AM

Hello Forum,

I am using AIC to rank regression models from Proc Reg. It looks like SAS is using an incorrect value for the "K" term (number of estimable model parameters) in the AIC formula. According to the literature (e.g., D.R. Anderson & K.P. Burnham "Avoiding pitfalls when using information-theoretic methods", Journal of Wildlife Mgmt 66(3):912-918), when using AIC with least-squares regression, K is equal to the number of dep vars in the model + the intercept + the error term. When I calculate AIC "by hand" and compare it to the SAS value, it looks like SAS is not including the error term as one of the parameters (essentially using K-1).

Has anyone else noticed this or am I nuts? I'm still using SAS 9.1.3, so maybe this is not an issue in 9.2.

Thanks for any comments,
Frances

Dale · Posted 05-13-2010 05:02 PM

Frances,

What difference does it really make whether the error variance in an OLS model is included as a parameter when computing AIC? Suppose that you have models 1, 2, and 3, each with a different (non-nested) set of fixed-effect parameters fitted to the same set of observations. These models have error sums of squares SSE{1}, SSE{2}, and SSE{3}, and number of regression parameters p{1}, p{2}, and p{3} (where regression parameters include all beta_hat estimates).

Now, if we use the AIC values presented by PROC REG, we have

AIC{1a} = n * ln( SSE{1} /n ) + 2p{1}
AIC{2a} = n * ln( SSE{2} /n ) + 2p{2}
AIC{3a} = n * ln( SSE{3} /n ) + 2p{3}

Differences between AIC values for these models are:

AIC{1a} - AIC{2a} = n * ( ln( SSE{1}/n ) - ln( SSE{2}/n ) ) + 2(p{1} - p{2})
AIC{1a} - AIC{3a} = n * ( ln( SSE{1}/n ) - ln( SSE{3}/n ) ) + 2(p{1} - p{3})
AIC{2a} - AIC{3a} = n * ( ln( SSE{2}/n ) - ln( SSE{3}/n ) ) + 2(p{2} - p{3})

Alternatively, according to to Anderson and Burnham, you would compute

AIC{1a} = n * ln( SSE{1} /n ) + 2(p{1} + 1)
AIC{2a} = n * ln( SSE{2} /n ) + 2(p{2} + 1)
AIC{3a} = n * ln( SSE{3} /n ) + 2(p{3} + 1)

Differences between AIC values for these models are:

AIC{1a} - AIC{2a} = n * ( ln( SSE{1}/n ) - ln( SSE{2}/n ) ) + 2((p{1} + 1) - (p{2} + 1))
                          = n * ( ln( SSE{1}/n ) - ln( SSE{2}/n ) ) + 2(p{1} - p{2})

AIC{1a} - AIC{3a} = n * ( ln( SSE{1}/n ) - ln( SSE{3}/n ) ) + 2((p{1} + 1) - (p{3} + 1))
                          = n * ( ln( SSE{1}/n ) - ln( SSE{3}/n ) ) + 2(p{1} - p{3})

AIC{2a} - AIC{3a} = n * ( ln( SSE{2}/n ) - ln( SSE{3}/n ) ) + 2((p{2} + 1) - (p{3} + 1))
                          = n * ( ln( SSE{2}/n ) - ln( SSE{3}/n ) ) + 2(p{2} - p{3})

So, when you compare AIC values against one another, you obtain the same difference whether you do or do not include the variance estimate as one of the parameters. And since the difference between AIC values is unchanged according to whether or not you include the residual variance as a parameter, then it should not matter which form is employed. Can you provide an instance where it would make a difference in model comparisons whether you do or do not include the variance estimate among the parameters?

View solution in original post

deleted_user · Posted 04-11-2010 11:17 AM

Should be computing correctly. Are you accounting for categorical variables, and subtracting 1 param from the total number of classes (n) in each categorical variable, i.e., K = n-1 for each cat variable? Common mistake.

Frances · Posted 04-23-2010 07:41 PM

Thanks for your reply bgdphd.

There are no catagorical variables in the models. The models are OLS linear multiple regression models using continuous variables. Consider, for example, the following watershed-scale model:

Dissolved Nitrogen Concentration = intercept + (percent wetland) + (mean slope) + (watershed area) + error term

According to the reference I cited in the original question, K should be equal to the intercept term + the number of explanatory variables + the error term. For the example model above, K = 5; however, SAS appears to be using K = 4.

An exmaple of the code where I request the AIC value looks something like:
Proc reg data=test;
model nitrogen = wetland slope area / selection=adjrsq aic;
run;

Am I interpreting something wrong? Any comments appreciated.

Frances · Posted 04-27-2010 05:48 PM

Thought I would post an update. After more Google searching I did find some references that others have made to this problem. Apparently, in multiple linear regression, SAS does use a different value for K that is not consistent with the methodolgy given in Burnham & Anderson. In other cases, such as logistic regression, the computation of AIC is consistent.
It would be great if SAS would remedy this and provide an option to output AICc in Proc Reg as well.
So, be warned.
For a nice summary, see:
Joshua D. Stafford, Bronson K. Strickland, Potential inconsistencies when computing Akaikes information criterion. Bulletin of the Ecological Society of America. Volume 84, Issue 2 (April 2003) pp. 68-69.

Dale · Posted 04-27-2010 10:40 PM

Frances,

I think you might prefer to use the MIXED procedure to fit models and obtain IC statistics. When using the MIXED procedure and estimation via maximum likelihood, AIC = -2LL + 2*(q + p) where q is the number of parameters in the covariance matrix and p is the number of parameters that are estimated as part of the model fixed effects.

Note, though, that if you use the MIXED procedure and use REML estimation, the AIC formula is AIC = -2LL + 2q. It should be observed that when estimating models employing REML, it is not appropriate to use likelihood-based statistics to select parameters which are among the model fixed effects. So the above computation of AIC for REML estimation is appropriate.

deleted_user · Posted 04-30-2010 04:19 PM

Ditto above. I recommend using METHOD=ML in place of default REML for parameter estimation, and in order to proceed with multi-model inference or comparison of nested candidate models using AIC. In your example, q=0, since you're not modeling covariance structure, and MIXED should compute AIC and other IC correctly.

Frances · Posted 05-07-2010 06:30 PM

Thanks for the advice bgdphd & Dale,

Now that I'm aware of the AIC calculation discrepency, I can fix the value in my code. I am not sure why I would use the MIXED procedure, however. I thought one typically chose LMM and GLMM methods if you were dealing with fixed and random effects? We are only dealing with fixed effects in our analysis. Any comments on this?

deleted_user · Posted 05-20-2010 12:09 PM

You can still proceed with MIXED.

deleted_user · Posted 05-09-2010 11:24 AM

the calculation formular of AIC is clearly described in the section of reg procedure of sas onlinehelp document, which is AIC = nlog(SSE/n)+2p, where p is the number of parameters including the intercept. p do not includes error terms.

deleted_user · Posted 05-09-2010 11:47 AM

from my limited knowledge about AIC, i do not think sas has made a careless mistake on its calculation. the calcualtion formular is right for least squares regression.
the following linkage describes the details of AIC
http://en.wikipedia.org/wiki/Akaike_information_criterion

Dale · Posted 05-13-2010 05:02 PM

Frances,

What difference does it really make whether the error variance in an OLS model is included as a parameter when computing AIC? Suppose that you have models 1, 2, and 3, each with a different (non-nested) set of fixed-effect parameters fitted to the same set of observations. These models have error sums of squares SSE{1}, SSE{2}, and SSE{3}, and number of regression parameters p{1}, p{2}, and p{3} (where regression parameters include all beta_hat estimates).

Now, if we use the AIC values presented by PROC REG, we have

AIC{1a} = n * ln( SSE{1} /n ) + 2p{1}
AIC{2a} = n * ln( SSE{2} /n ) + 2p{2}
AIC{3a} = n * ln( SSE{3} /n ) + 2p{3}

Differences between AIC values for these models are:

AIC{1a} - AIC{2a} = n * ( ln( SSE{1}/n ) - ln( SSE{2}/n ) ) + 2(p{1} - p{2})
AIC{1a} - AIC{3a} = n * ( ln( SSE{1}/n ) - ln( SSE{3}/n ) ) + 2(p{1} - p{3})
AIC{2a} - AIC{3a} = n * ( ln( SSE{2}/n ) - ln( SSE{3}/n ) ) + 2(p{2} - p{3})

Alternatively, according to to Anderson and Burnham, you would compute

AIC{1a} = n * ln( SSE{1} /n ) + 2(p{1} + 1)
AIC{2a} = n * ln( SSE{2} /n ) + 2(p{2} + 1)
AIC{3a} = n * ln( SSE{3} /n ) + 2(p{3} + 1)

Differences between AIC values for these models are:

AIC{1a} - AIC{2a} = n * ( ln( SSE{1}/n ) - ln( SSE{2}/n ) ) + 2((p{1} + 1) - (p{2} + 1))
                          = n * ( ln( SSE{1}/n ) - ln( SSE{2}/n ) ) + 2(p{1} - p{2})

AIC{1a} - AIC{3a} = n * ( ln( SSE{1}/n ) - ln( SSE{3}/n ) ) + 2((p{1} + 1) - (p{3} + 1))
                          = n * ( ln( SSE{1}/n ) - ln( SSE{3}/n ) ) + 2(p{1} - p{3})

AIC{2a} - AIC{3a} = n * ( ln( SSE{2}/n ) - ln( SSE{3}/n ) ) + 2((p{2} + 1) - (p{3} + 1))
                          = n * ( ln( SSE{2}/n ) - ln( SSE{3}/n ) ) + 2(p{2} - p{3})

So, when you compare AIC values against one another, you obtain the same difference whether you do or do not include the variance estimate as one of the parameters. And since the difference between AIC values is unchanged according to whether or not you include the residual variance as a parameter, then it should not matter which form is employed. Can you provide an instance where it would make a difference in model comparisons whether you do or do not include the variance estimate among the parameters?

Registration is open