Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Train-Validate-Test

Accepted Solution Solved
Reply
N/A
Posts: 1
Accepted Solution

Train-Validate-Test

Hi, I have a great confusion about what exactly SAS Enterprise Miner is doing with different validation settings.

1st case: For example, Let's suppose you only have one dataset node connected to a Regression node to perform logistic regression. After selecting Validation error for Selection criteria for Stepwise under Model selection, you run this simple model. What exactly is happening with validation here? what is the percentages used for training and validation sets? what kind of validation is used? k-fold, etc..which model is being selected?

2nd case: This time suppose you have a data partition mode in between dataset and regression nodes and you set the partition to 70% train and 30%validate (no test). Now when you run the model, which partition is being used for validation? the partition from the data partition node? or whatever validation technique is being used under Regression properties window?

3rd case: Suppose now you have another path from data partition to neural networks and finally a model comparison node that you connect both regression and neural network nodes into. What partition is used  for the comparison statistics? what method is being used?

I think my main problem is the difference between the validation procedure for a single model type versus when you have different models. my guessing is that, for single model, where there is no comparison with other models,  validation just plays a role like testing, is that true?

I greatly appreciate the help and please let me know if my questions are not clear.

thanks

Prof. Boylu


Accepted Solutions
Solution
Tuesday
SAS Employee
Posts: 121

Re: Train-Validate-Test

Your questions are largely addressed in the SAS Enterprise Miner help utility.  You can access this utility by opening SAS Enterprise Miner and clicking on Help --> Contents and then navigating in the panel on the left to 

 

Node Reference

     Model

           Regression Node 

 

From there, click on the link to Regression Node Model Selection Criteria which shares the following information:

 

  • Validation Error — chooses the model that has the smallest error rate for the validation data set. For logistic regression models, the error is the negative log-likelihood. For linear regression, the error is the error sum of squares (SSE). This option is grayed out if a validation predecessor data set is not input to the Regression node.

In reality, you can select the option but it is ignored.  So in your first case, there is no validation data set present so it would just use the Stepwise options which appear in the section just above the Regression Node Model Selection Criteria where it says

 

  • Stepwise — As in the Forward method, Stepwise selection begins, by default, with no candidate effects in the model and then systematically adds effects that are significantly associated with the target. However, after an effect is added to the model, Stepwise may remove any effect already in the model that is not significantly associated with the target.

    This stepwise process continues until one of the following occurs:

    • No other effect in the model meets the Stay Significance Level.
    • The Max Steps criterion is met. If you choose the Stepwise selection method, then you can specify a Max Steps to put a limit on the number of steps before the effect selection process stops. The default value is set to the number of effects in the model. If you add interactions via the Interaction Builder, the Max Steps is automatically updated to include these terms.
    • An effect added in one step is the only effect deleted in the next step.

 

In your second scenario, you have a Validation data set so the partition created by the Data Partition node will be scored and assessed on the model trained in each step of the stepwise selection process.   The selected model is identified in the Output window and will read something like the following:

The selected model, based on the error rate for the validation data, is the model trained in Step 3. It consists of the following effects:

 

 

If there is no validation data set, it would provide something like the following indicating the last model was selected:

 


The selected model is the model trained in the last step (Step 7). It consists of the following effects:

 

For your final scenario, you would need to review the help for the Model Comparison node which can be accessed by clicking on Help --> Contents and then navigating in the panel on the left to 

 

Node Reference

    Assess Nodes

          Model Comparison Node

 

and then clicking on Model Comparison Node Train Properties: Model Selection Properties which has a great deal of possible outcomes depending on the settings you choose.   Here is an excerpt from the help utility:

 

  • Selection Statistic — Use the Selection Statistic property of the Model Comparison node to specify the fit statistic that you want to use to select the model. Depending on the availability, different fit statistics are used.

    When Selection Statistic is set to DEFAULT, the average profit statistic from the validation data (_VAPROF_) is used for model selection. If the _VAPROF_ statistic is not present, the average loss statistic from the validation data (_VALOSS_)is used.

    If no validation data set is present, the associate training statistic for average profit (_APROF_) or average loss (_ALOSS_) is used.

    If no Selection Statistic is specified, the proportion of misclassified data in the validation data set (_VMISC_) is used for model selection. If the _VMISC_ statistic is not present, the average squared error statistic from the validation data set (_VASE_) is used. If no validation data set is present, the associate training statistic for misclassified data (_MISC_) or average squared error (_ASE_) is used.

 

I hope this helps!

Doug

 

 

View solution in original post


All Replies
Solution
Tuesday
SAS Employee
Posts: 121

Re: Train-Validate-Test

Your questions are largely addressed in the SAS Enterprise Miner help utility.  You can access this utility by opening SAS Enterprise Miner and clicking on Help --> Contents and then navigating in the panel on the left to 

 

Node Reference

     Model

           Regression Node 

 

From there, click on the link to Regression Node Model Selection Criteria which shares the following information:

 

  • Validation Error — chooses the model that has the smallest error rate for the validation data set. For logistic regression models, the error is the negative log-likelihood. For linear regression, the error is the error sum of squares (SSE). This option is grayed out if a validation predecessor data set is not input to the Regression node.

In reality, you can select the option but it is ignored.  So in your first case, there is no validation data set present so it would just use the Stepwise options which appear in the section just above the Regression Node Model Selection Criteria where it says

 

  • Stepwise — As in the Forward method, Stepwise selection begins, by default, with no candidate effects in the model and then systematically adds effects that are significantly associated with the target. However, after an effect is added to the model, Stepwise may remove any effect already in the model that is not significantly associated with the target.

    This stepwise process continues until one of the following occurs:

    • No other effect in the model meets the Stay Significance Level.
    • The Max Steps criterion is met. If you choose the Stepwise selection method, then you can specify a Max Steps to put a limit on the number of steps before the effect selection process stops. The default value is set to the number of effects in the model. If you add interactions via the Interaction Builder, the Max Steps is automatically updated to include these terms.
    • An effect added in one step is the only effect deleted in the next step.

 

In your second scenario, you have a Validation data set so the partition created by the Data Partition node will be scored and assessed on the model trained in each step of the stepwise selection process.   The selected model is identified in the Output window and will read something like the following:

The selected model, based on the error rate for the validation data, is the model trained in Step 3. It consists of the following effects:

 

 

If there is no validation data set, it would provide something like the following indicating the last model was selected:

 


The selected model is the model trained in the last step (Step 7). It consists of the following effects:

 

For your final scenario, you would need to review the help for the Model Comparison node which can be accessed by clicking on Help --> Contents and then navigating in the panel on the left to 

 

Node Reference

    Assess Nodes

          Model Comparison Node

 

and then clicking on Model Comparison Node Train Properties: Model Selection Properties which has a great deal of possible outcomes depending on the settings you choose.   Here is an excerpt from the help utility:

 

  • Selection Statistic — Use the Selection Statistic property of the Model Comparison node to specify the fit statistic that you want to use to select the model. Depending on the availability, different fit statistics are used.

    When Selection Statistic is set to DEFAULT, the average profit statistic from the validation data (_VAPROF_) is used for model selection. If the _VAPROF_ statistic is not present, the average loss statistic from the validation data (_VALOSS_)is used.

    If no validation data set is present, the associate training statistic for average profit (_APROF_) or average loss (_ALOSS_) is used.

    If no Selection Statistic is specified, the proportion of misclassified data in the validation data set (_VMISC_) is used for model selection. If the _VMISC_ statistic is not present, the average squared error statistic from the validation data set (_VASE_) is used. If no validation data set is present, the associate training statistic for misclassified data (_MISC_) or average squared error (_ASE_) is used.

 

I hope this helps!

Doug

 

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 1097 views
  • 0 likes
  • 2 in conversation