DFT and DFM

4Walk · Posted 01-27-2018 03:50 PM

Hi, This is about Fit Statistics reported in SAS EM Linear Regression. I am a little confused about Total Degrees of Freedom (_DFT_) and Model Degrees of Freedom (_DFM_) reported in the Fit Statistics. For Linear Regression, DFT is reported as number of cases and DFM is reported as no. of predictors + 1. I don't understand why these are not n and p respectively. I appreciate clarification. Thank you, Dileep

DougWielenga · Posted 01-31-2018 05:48 PM

@4Walk wrote:

I am a little confused about Total Degrees of Freedom (_DFT_) and Model Degrees of Freedom (_DFM_) reported in the Fit Statistics. For Linear Regression, DFT is reported as number of cases and DFM is reported as no. of predictors + 1. I don't understand why these are not n and p respectively.

In situations where you have

* an interval target

* all interval input variables

* no missing data for any input variable for any observation

your statement would be correct. In a simple ordinary least squares (OLS) regression model, you would have (n-1) degrees of freedom (you lose one for estimating the intercept) and 1 degree of freedom (DF) for each interval input in the model. As a result, you would calculate there are n-1-p DF left to estimate error (DFE) which you could use to compute meaningful error estimates which are then used for computing things like confidence intervals and parameter/effect tests based on the usual assumptions.

Unfortunately, this situation almost never happens in data mining problems. There are often binary targets, categorical input variables (each k-level categorical input requires k-1 degrees of freedom in the model), and missing values for certain input variables. Even if there is an interval target, all observations with any missing input value would be ignored in the usual OLS regression model. In most data mining problems, taking this approach would often result in ignoring most of your data.

The missing values are typically imputed meaning that the DFE are overstated and any associated statistical test or calculation such as confidence bounds are not as meaningful. That is OK though since in most data mining scenarios, you have sufficient data to validate your model empirically rather than relying solely on error estimates from a single limited data set. Your goal in data mining is typically prediction rather than hypothesis testing. Since data mining data sets are often huge, you would end up finding trivial differences as highly significant due the number of observations and the overestimate in the DFE calculated by the usual methods.

As a result, you will see some differences in how the Fit Statistics for regression models are computed inside of SAS Enterprise Miner. In order to make different kinds of modeling approaches comparable, SAS Enterprise Miner reports statistics which are not dependent on the modeling approach.

For example, the Fit Statistics report in the modeling node results reports Average Squared Error (ASE) rather than Mean Square Error (MSE). The notion of MSE is appropriate for an OLS regression model but it is not appropriate when you have imputed missing values. If you have fit other models such as a Tree, there is no parameter that is being estimated like in regression so there is no concept of "model" degrees of freedom. The desire to compute the same Fit Statistics across all modeling nodes for a given target variable and data set regardless of the modeling approach leads us to using slightly different metrics.

To illustrate, I will use the sample Home Equity data set (SAMPSIO.HMEQ) which can be created by clicking on

Help --> Generate Sample Data Sources....

inside a SAS Enterprise Miner project. This data set has a binary target BAD (0/1). I used a default Data Partition node which placed 2,382 observations in the training data set. Although this data set has missing values, the Total Degrees of Freedom (_DFT_) is reported as 2,382. Were this an OLS regression, the number would be much smaller since observations with missing values would be ignored completely. I then add a default main effects Regression node which generates a model. SAS Enterprise Miner reports

_DFT_ : 2,382

_DFE_ : 43 (see the Output to find a listing for 43 parameter estimates

_DFM_ : 2,339 (equals 2,382 - 43)

The reason SAS Enterprise Miner can use all of the observations in this regression model is because it assigns the overall average as the predicted value for the observations which have missing data. In practice, you would want to impute the missing values to be able to use more of your data but the _DFT_ would remain the same. The use of the imputation approach however could change the model and lead to different values for _DFM_ and _DFE_.

Either way, these concepts are not useful in the same way they are in OLS regression for the reasons I mentioned above.

Hope this helps!

Doug

DFT and DFM

Re: DFT and DFM