Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Sampling Node

Accepted Solution Solved
Reply
Highlighted
Contributor
Posts: 35
Accepted Solution

Sampling Node

Hi there

 

I have been using SAS Miner for a while, using a Sampling Node right after my Input Node to perform under-sampling (for now). Follow from this Sampling Node my Data Partition Node and some Transform Variable / Impute Nodes. I have a few questions:

1. what would be the effect of performing the sampling after having worked on the variables, e.g having the Sampling Node after the Impute Node. 

2. If I do as in 1, I believe the Data Partition Node shall as well be moved accordingly not to have the Data Partition ode right after the Input Node and perform the partition on the unbalanced data?

 

Many thanks

Nicolas

 


Accepted Solutions
Solution
a week ago
SAS Employee
Posts: 226

Re: Sampling Node

1. what would be the effect of performing the sampling after having worked on the variables, e.g having the Sampling Node after the Impute Node. 

 

The Data Partition node splits the raw data into Training, Validation, and Test (if desired).  The Training data is used to build candidate models, the Validation is typically used to compare competing candidate models, and the Test data is intended to allow for an unbiased estimate of performance when it has not been used for building or choosing the final model.   If you use the Test data for choosing models, it really represents a second Validation data set rather than a Test data set since it cannot produce an unbiased estimate of model performance.   The Training and Validation data sets are the most critical in this approach.   

Sampling following the Impute node is not problematic when the only values being imputed are essentially known even though they are not present in the training data.  For example, certain coding approaches involved assigning a 1 when a condition was met but no assignment was made if the condition was not met.  This left the observations with either a 1 or a missing value which is stored as a dot (.) for numeric variables in a SAS data set.   These values are reasonably assumed to be 0 since they are not a 1.   Similarly, a non-profit that records donations might have missing value for how much a donor gave in a certain month.  This is reasonably assumed to be $0.00 in this situation.  If you are using other methods to "guess" at the value such as using mean-imputation or tree-imputation, doing so before partitioning effectively uses some of your Validation data (that should be used for comparing models) as Training observations (used for building the model) since the imputation process is part of the model building.  It is not 'wrong' per say but it does potentially taint the usefulness of your Validation data set.   It would be better not to take this approach in most situations.  

2. If I do as in 1, I believe the Data Partition Node shall as well be moved accordingly not to have the Data Partition ode right after the Input Node and perform the partition on the unbalanced data?

 

I would typically discourage moving the Impute node before the Data Partition node since doing so effectively treats validation observations as if they were training observations.  Since the observations were used to help train the model (even though it is only through imputing missing values), the ability for these observation to validate the candidate model(s) fit to training is tainted to some degree.  

 

Hope this helps!

Doug

View solution in original post


All Replies
Solution
a week ago
SAS Employee
Posts: 226

Re: Sampling Node

1. what would be the effect of performing the sampling after having worked on the variables, e.g having the Sampling Node after the Impute Node. 

 

The Data Partition node splits the raw data into Training, Validation, and Test (if desired).  The Training data is used to build candidate models, the Validation is typically used to compare competing candidate models, and the Test data is intended to allow for an unbiased estimate of performance when it has not been used for building or choosing the final model.   If you use the Test data for choosing models, it really represents a second Validation data set rather than a Test data set since it cannot produce an unbiased estimate of model performance.   The Training and Validation data sets are the most critical in this approach.   

Sampling following the Impute node is not problematic when the only values being imputed are essentially known even though they are not present in the training data.  For example, certain coding approaches involved assigning a 1 when a condition was met but no assignment was made if the condition was not met.  This left the observations with either a 1 or a missing value which is stored as a dot (.) for numeric variables in a SAS data set.   These values are reasonably assumed to be 0 since they are not a 1.   Similarly, a non-profit that records donations might have missing value for how much a donor gave in a certain month.  This is reasonably assumed to be $0.00 in this situation.  If you are using other methods to "guess" at the value such as using mean-imputation or tree-imputation, doing so before partitioning effectively uses some of your Validation data (that should be used for comparing models) as Training observations (used for building the model) since the imputation process is part of the model building.  It is not 'wrong' per say but it does potentially taint the usefulness of your Validation data set.   It would be better not to take this approach in most situations.  

2. If I do as in 1, I believe the Data Partition Node shall as well be moved accordingly not to have the Data Partition ode right after the Input Node and perform the partition on the unbalanced data?

 

I would typically discourage moving the Impute node before the Data Partition node since doing so effectively treats validation observations as if they were training observations.  Since the observations were used to help train the model (even though it is only through imputing missing values), the ability for these observation to validate the candidate model(s) fit to training is tainted to some degree.  

 

Hope this helps!

Doug

Contributor
Posts: 35

Re: Sampling Node

Posted in reply to DougWielenga

Thanks Doug!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 182 views
  • 0 likes
  • 2 in conversation