05-21-2017 11:47 PM
When using SAS Enterprise Miner to perform logistic regression on partitioned data, how does EMiner select the "best" model?
Assuming you have partitioned data into Training and Validation data sets (and have selected Validation Misclassification Error as the metric to optimise - ignore test data sets for now), how does EMiner iterate through the two data sets to arrive at the best model?
How does the default methodology compare to or differ from other model training techniques such as k-fold cross validation, and what would be the equivalent methodology if modelling in R?
05-26-2017 10:53 AM
The model comparison node in SAS EM provides various model errors (based on some performance metrics such as average squared error or misclassification rate) for all of the available data partitions (training, validation and testing). If you didn’t use the validation set during the model building stage, you can go ahead and use the validation error to compare your models. However, if the validation set is used in the model building process (i.e in hyperparameter tuning), its error would be biased downward (similar to the training error) due to overfitting issues... If this is the case, I recommend you to use your test set errors (assuming this is the final stage of your modeling) for choosing the champion model.
Cross validation is a more reliable technique to compare models because instead of evaluating your modeling on a single partition of the data it repeats the single holdout concept across different folds and then averages the error. For more information about how to perform cross validation see my following posts:
05-26-2017 11:07 PM
I should have clarified my question; what I am trying to understand is how within a specific modelling node (e.g. a regression node) does the modelling node utilise the training and validation data to arrive at the chosen model?
Given the case where I partition my data 60/40 training/validation and no test data, when I pass the data to a modelling node (regression, decision tree, neural network etc), I am guessing EM will use the combination of training and validation data to iteratively select the best model (i.e. training the model on the training data and using the validation data to generalise the model and avoid overfitting - you mention hyperparameter tuning in your reply). This is before the result is sent to a model comparison node to select the best from a range of models.
So my question really is about what goes on with the training and validation data sets within an individual modelling node and how does this differ from other techniques such as cross validation?
05-30-2017 02:01 PM
Suppose you first partitioned your data into training and validation sets by using the Data Partition node, and then connected the Data Partition node to the Regression node available in the Model tab. If you run the Regression node without changing the default selection method (which is “None”), the validation set won’t be used at all. However, if you change the model selection method from None to any other selection method (such as step-wise) and also choose Validation Error as the Selection Criterion, then at each step of the selection process the model error will be calculated based on the validation data, and the step where the validation error is the smallest will be selected as your final model.
Instead of validation error if, you pick cross validation as the Selection Criterion, then at each selection step the cross validation error will be calculated by using the training part of the data. Thus, the validation data set again won’t be used in the model selection process. Therefore if you choose to use cross validation, you do not need to set aside an additional validation set. I hope it is clearer now.