BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Re: Applied Analytics Using SAS Enterprise Miner

Would it be possible to clarify how ASE (Average Square Error) is calculated (its definition is given at page 3.72 of the course notes)?
Asking this because, by looking at the output from any modelling node, it looks like the denominator is based on the total number of cases in the whole sample (Training+Validation), not just Training or Validation datasets (see image at page 3.89 for output from for Decision Tree; same applies to Regression node - see example at page 4-43).
Moreover, in the output from a Regression node, the Mean Square Error (MSE) should be calculated as Sum of Squared Errors (SSE) divided by the Degrees of Freedom or Error (DFE); however, that does not seem to be the case; here is a screenshot based on the model fitted at page 4-42 of the course notes:
fit_statistics.png
 

 

1 ACCEPTED SOLUTION

Accepted Solutions
gcjfernandez
SAS Employee

Thank you for your explanations.

Based on the formula at page 3-72, is it correct to say that the numerator part of ASE takes account of both predicted probabilities, primary and secondary class, for a given case Yi? (i.e. essentially counting "residuals" twice)

Yes that is correct.

Is that expression used to calculate the SSE value for the Regression Node as well? In that case, it would be different from the "classic" definition where only differences against the predicted probability of the primary event are considered.

The regression node can fit both MLR (Interval target) and Binary Logistic Regression (BLR). For BLR  the above formula is correct for ASE.

For MLR, ASE= SSE/N;

 

Last point: in course "Predictive Modeling Using Logistic Regression", at page B-11 of the course notes, the way ASE is calculated within macro %ASSESS is not based on the expression from page 3-72 (likewise SSE is just the sum of the squared differences between observed and fitted value): would the two approached yield different numerical results or would they just be equivalent for a binary outcome?

Yes you are correct. When using Proc Logistic to fit BLR we defined the target level (whether we are modeling 1 or 0). Whereas in SAS EM regression node, we do not define whether we want to model 1 or 0. Therefore both predicted probabilities are included in the ASE computation.

View solution in original post

3 REPLIES 3
gcjfernandez
SAS Employee

Re: Applied Analytics Using SAS Enterprise Miner

Would it be possible to clarify how ASE (Average Square Error) is calculated (its definition is given at page 3.72 of the course notes)?
Asking this because, by looking at the output from any modelling node, it looks like the denominator is based on the total number of cases in the whole sample (Training+Validation), not just Training or Validation datasets (see image at page 3.89 for output from for Decision Tree; same applies to Regression node - see example at page 4-43).
My Answer:
When the target variable is interval the denominator for ASE is N (Training or Validation sample size) Please see Course PDF 3-72
 

When the Target variable is Binary the denominator for ASE is N x 2 (2 levels: Event and non event) Please see Course PDF 3-72

In demo data because we are making 50:50 split for Training and validation it appears that the denominator is (Train +validation)

But actually for training ASE = SSE/2N.

Moreover, in the output from a Regression node, the Mean Square Error (MSE) should be calculated as Sum of Squared Errors (SSE) divided by the Degrees of Freedom or Error (DFE); however, that does not seem to be the case; here is a screenshot based on the model fitted at page 4-42 of the course notes:

My Answer:

In computing MSE for training and validation data DFE is not used in SASEM. It  is using N as the denominator. Because in Decision Tree and Neural Net there are no Model degrees of freedom. Therefore no Error DF. Similarly in Validation data no model is fitted. Therefore in order have a comparable Error estimate across DT, Reg, and NN, it is using N as the denominator in MSE.

 
pvareschi
Quartz | Level 8

Thank you for your explanations.

Based on the formula at page 3-72, is it correct to say that the numerator part of ASE takes account of both predicted probabilities, primary and secondary class, for a given case Yi? (i.e. essentially counting "residuals" twice)

Is that expression used to calculate the SSE value for the Regression Node as well? In that case, it would be different from the "classic" definition where only differences against the predicted probability of the primary event are considered.

Last point: in course "Predictive Modeling Using Logistic Regression", at page B-11 of the course notes, the way ASE is calculated within macro %ASSESS is not based on the expression from page 3-72 (likewise SSE is just the sum of the squared differences between observed and fitted value): would the two approached yield different numerical results or would they just be equivalent for a binary outcome?

gcjfernandez
SAS Employee

Thank you for your explanations.

Based on the formula at page 3-72, is it correct to say that the numerator part of ASE takes account of both predicted probabilities, primary and secondary class, for a given case Yi? (i.e. essentially counting "residuals" twice)

Yes that is correct.

Is that expression used to calculate the SSE value for the Regression Node as well? In that case, it would be different from the "classic" definition where only differences against the predicted probability of the primary event are considered.

The regression node can fit both MLR (Interval target) and Binary Logistic Regression (BLR). For BLR  the above formula is correct for ASE.

For MLR, ASE= SSE/N;

 

Last point: in course "Predictive Modeling Using Logistic Regression", at page B-11 of the course notes, the way ASE is calculated within macro %ASSESS is not based on the expression from page 3-72 (likewise SSE is just the sum of the squared differences between observed and fitted value): would the two approached yield different numerical results or would they just be equivalent for a binary outcome?

Yes you are correct. When using Proc Logistic to fit BLR we defined the target level (whether we are modeling 1 or 0). Whereas in SAS EM regression node, we do not define whether we want to model 1 or 0. Therefore both predicted probabilities are included in the ASE computation.

 

This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:

Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1459 views
  • 0 likes
  • 2 in conversation