BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NicolasC
Fluorite | Level 6

Hi there

 

I have been using SAS Miner for a while, using a Sampling Node right after my Input Node to perform under-sampling (for now). Follow from this Sampling Node my Data Partition Node and some Transform Variable / Impute Nodes. I have a few questions:

1. what would be the effect of performing the sampling after having worked on the variables, e.g having the Sampling Node after the Impute Node. 

2. If I do as in 1, I believe the Data Partition Node shall as well be moved accordingly not to have the Data Partition ode right after the Input Node and perform the partition on the unbalanced data?

 

Many thanks

Nicolas

 

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

1. what would be the effect of performing the sampling after having worked on the variables, e.g having the Sampling Node after the Impute Node. 

 

The Data Partition node splits the raw data into Training, Validation, and Test (if desired).  The Training data is used to build candidate models, the Validation is typically used to compare competing candidate models, and the Test data is intended to allow for an unbiased estimate of performance when it has not been used for building or choosing the final model.   If you use the Test data for choosing models, it really represents a second Validation data set rather than a Test data set since it cannot produce an unbiased estimate of model performance.   The Training and Validation data sets are the most critical in this approach.   

Sampling following the Impute node is not problematic when the only values being imputed are essentially known even though they are not present in the training data.  For example, certain coding approaches involved assigning a 1 when a condition was met but no assignment was made if the condition was not met.  This left the observations with either a 1 or a missing value which is stored as a dot (.) for numeric variables in a SAS data set.   These values are reasonably assumed to be 0 since they are not a 1.   Similarly, a non-profit that records donations might have missing value for how much a donor gave in a certain month.  This is reasonably assumed to be $0.00 in this situation.  If you are using other methods to "guess" at the value such as using mean-imputation or tree-imputation, doing so before partitioning effectively uses some of your Validation data (that should be used for comparing models) as Training observations (used for building the model) since the imputation process is part of the model building.  It is not 'wrong' per say but it does potentially taint the usefulness of your Validation data set.   It would be better not to take this approach in most situations.  

2. If I do as in 1, I believe the Data Partition Node shall as well be moved accordingly not to have the Data Partition ode right after the Input Node and perform the partition on the unbalanced data?

 

I would typically discourage moving the Impute node before the Data Partition node since doing so effectively treats validation observations as if they were training observations.  Since the observations were used to help train the model (even though it is only through imputing missing values), the ability for these observation to validate the candidate model(s) fit to training is tainted to some degree.  

 

Hope this helps!

Doug

View solution in original post

2 REPLIES 2
DougWielenga
SAS Employee

1. what would be the effect of performing the sampling after having worked on the variables, e.g having the Sampling Node after the Impute Node. 

 

The Data Partition node splits the raw data into Training, Validation, and Test (if desired).  The Training data is used to build candidate models, the Validation is typically used to compare competing candidate models, and the Test data is intended to allow for an unbiased estimate of performance when it has not been used for building or choosing the final model.   If you use the Test data for choosing models, it really represents a second Validation data set rather than a Test data set since it cannot produce an unbiased estimate of model performance.   The Training and Validation data sets are the most critical in this approach.   

Sampling following the Impute node is not problematic when the only values being imputed are essentially known even though they are not present in the training data.  For example, certain coding approaches involved assigning a 1 when a condition was met but no assignment was made if the condition was not met.  This left the observations with either a 1 or a missing value which is stored as a dot (.) for numeric variables in a SAS data set.   These values are reasonably assumed to be 0 since they are not a 1.   Similarly, a non-profit that records donations might have missing value for how much a donor gave in a certain month.  This is reasonably assumed to be $0.00 in this situation.  If you are using other methods to "guess" at the value such as using mean-imputation or tree-imputation, doing so before partitioning effectively uses some of your Validation data (that should be used for comparing models) as Training observations (used for building the model) since the imputation process is part of the model building.  It is not 'wrong' per say but it does potentially taint the usefulness of your Validation data set.   It would be better not to take this approach in most situations.  

2. If I do as in 1, I believe the Data Partition Node shall as well be moved accordingly not to have the Data Partition ode right after the Input Node and perform the partition on the unbalanced data?

 

I would typically discourage moving the Impute node before the Data Partition node since doing so effectively treats validation observations as if they were training observations.  Since the observations were used to help train the model (even though it is only through imputing missing values), the ability for these observation to validate the candidate model(s) fit to training is tainted to some degree.  

 

Hope this helps!

Doug

NicolasC
Fluorite | Level 6

Thanks Doug!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1399 views
  • 0 likes
  • 2 in conversation