Hi, I have a great confusion about what exactly SAS Enterprise Miner is doing with different validation settings.
1st case: For example, Let's suppose you only have one dataset node connected to a Regression node to perform logistic regression. After selecting Validation error for Selection criteria for Stepwise under Model selection, you run this simple model. What exactly is happening with validation here? what is the percentages used for training and validation sets? what kind of validation is used? k-fold, etc..which model is being selected?
2nd case: This time suppose you have a data partition mode in between dataset and regression nodes and you set the partition to 70% train and 30%validate (no test). Now when you run the model, which partition is being used for validation? the partition from the data partition node? or whatever validation technique is being used under Regression properties window?
3rd case: Suppose now you have another path from data partition to neural networks and finally a model comparison node that you connect both regression and neural network nodes into. What partition is used for the comparison statistics? what method is being used?
I think my main problem is the difference between the validation procedure for a single model type versus when you have different models. my guessing is that, for single model, where there is no comparison with other models, validation just plays a role like testing, is that true?
I greatly appreciate the help and please let me know if my questions are not clear.
thanks
Prof. Boylu
Your questions are largely addressed in the SAS Enterprise Miner help utility. You can access this utility by opening SAS Enterprise Miner and clicking on Help --> Contents and then navigating in the panel on the left to
Node Reference
Model
Regression Node
From there, click on the link to Regression Node Model Selection Criteria which shares the following information:
In reality, you can select the option but it is ignored. So in your first case, there is no validation data set present so it would just use the Stepwise options which appear in the section just above the Regression Node Model Selection Criteria where it says
This stepwise process continues until one of the following occurs:
In your second scenario, you have a Validation data set so the partition created by the Data Partition node will be scored and assessed on the model trained in each step of the stepwise selection process. The selected model is identified in the Output window and will read something like the following:
The selected model, based on the error rate for the validation data, is the model trained in Step 3. It consists of the following effects:
If there is no validation data set, it would provide something like the following indicating the last model was selected:
The selected model is the model trained in the last step (Step 7). It consists of the following effects:
For your final scenario, you would need to review the help for the Model Comparison node which can be accessed by clicking on Help --> Contents and then navigating in the panel on the left to
Node Reference
Assess Nodes
Model Comparison Node
and then clicking on Model Comparison Node Train Properties: Model Selection Properties which has a great deal of possible outcomes depending on the settings you choose. Here is an excerpt from the help utility:
When Selection Statistic is set to DEFAULT, the average profit statistic from the validation data (_VAPROF_) is used for model selection. If the _VAPROF_ statistic is not present, the average loss statistic from the validation data (_VALOSS_)is used.
If no validation data set is present, the associate training statistic for average profit (_APROF_) or average loss (_ALOSS_) is used.
If no Selection Statistic is specified, the proportion of misclassified data in the validation data set (_VMISC_) is used for model selection. If the _VMISC_ statistic is not present, the average squared error statistic from the validation data set (_VASE_) is used. If no validation data set is present, the associate training statistic for misclassified data (_MISC_) or average squared error (_ASE_) is used.
I hope this helps!
Doug
Your questions are largely addressed in the SAS Enterprise Miner help utility. You can access this utility by opening SAS Enterprise Miner and clicking on Help --> Contents and then navigating in the panel on the left to
Node Reference
Model
Regression Node
From there, click on the link to Regression Node Model Selection Criteria which shares the following information:
In reality, you can select the option but it is ignored. So in your first case, there is no validation data set present so it would just use the Stepwise options which appear in the section just above the Regression Node Model Selection Criteria where it says
This stepwise process continues until one of the following occurs:
In your second scenario, you have a Validation data set so the partition created by the Data Partition node will be scored and assessed on the model trained in each step of the stepwise selection process. The selected model is identified in the Output window and will read something like the following:
The selected model, based on the error rate for the validation data, is the model trained in Step 3. It consists of the following effects:
If there is no validation data set, it would provide something like the following indicating the last model was selected:
The selected model is the model trained in the last step (Step 7). It consists of the following effects:
For your final scenario, you would need to review the help for the Model Comparison node which can be accessed by clicking on Help --> Contents and then navigating in the panel on the left to
Node Reference
Assess Nodes
Model Comparison Node
and then clicking on Model Comparison Node Train Properties: Model Selection Properties which has a great deal of possible outcomes depending on the settings you choose. Here is an excerpt from the help utility:
When Selection Statistic is set to DEFAULT, the average profit statistic from the validation data (_VAPROF_) is used for model selection. If the _VAPROF_ statistic is not present, the average loss statistic from the validation data (_VALOSS_)is used.
If no validation data set is present, the associate training statistic for average profit (_APROF_) or average loss (_ALOSS_) is used.
If no Selection Statistic is specified, the proportion of misclassified data in the validation data set (_VMISC_) is used for model selection. If the _VMISC_ statistic is not present, the average squared error statistic from the validation data set (_VASE_) is used. If no validation data set is present, the associate training statistic for misclassified data (_MISC_) or average squared error (_ASE_) is used.
I hope this helps!
Doug
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.