BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
EC189QRW
Obsidian | Level 7

HI there,

I am creating a decision tree in SAS EM. Consumer credit data sets-training,testing and validation are all from my different colleagues who used to made logistic regression models in the past 2 years. Unfortunately i just found out that the time windows are all different from each other. For instance,

Zone1's training and testing datasets  are from 01Jan2015 to 30Jun2016.

Zone1's Validation datasets are from 01Mar2015 to 31Aug2016. 

 

Zone2's training and testing datasets  are from 01Jul2015  to 31Dec2016.

Zone2's Validation datasets are from 01Dec2015 to 31May2017. 

 

Zone3's training and testing datasets  are from 01Jan2016 to 30Jun2017.

Zone3's Validation datasets are from 01May2016 to 31Oct2017. 

 

.... More than 15 different regions ....

As a rookie in sampling, I don't know what should i do. Can i combine all those datasets and considered it as a full picture of business in all regions ? is it rational? Any help will be appreciated. 

Thanks a lot!

Eric

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

EC189QRW,

 

In your example, you are dealing with a potential problem sometimes referred to as "Temporal Infidelity" which means it is possible the relationship among the variables changes over the time span.   This change could be due to normal changes (e.g. seasonal changes) or key events (e.g. major election, economic upturn or downturn, changing market share of product, etc...).  When this problem is present, the data collected at differing times are modeling different relationships (that is, a different "population") due even if it measuring the same people in a different time period.  Consider modeling the sales at the beach when you have some data from the high-traffic summer months and some data from the mid-fall to mid-spring "off-season".  Even if the data sets are partially overlapping, you could get some strange results.   

 

Dealing with real data is often messy.  Analysts are often brought in to 'analyze' data that was collected in ways that violate the usual underlying assumptions to some extent. In those cases, it is up to the individual analyst to assess the level of concern that should be given to those violations. The data does not reflect a perfect situation but it is still data.  If I could retain a very high percentage of my data from time periods present in all data sets, there is probably not a reason to include the "extra", but this strategy is more concerning when taking this approach means ignoring a substantial portion of my data.  If you do use data from non-overlapping periods, you are essentially assuming that there has been no change and that the differences are due to factors in your data rather than due to "Temporal Infidelity".  This is less of a concern when it is the Testing data that is collected at a later time since the Testing data is really intended to give you an unbiased estimate of model performance.   Scoring and evaluating more recent data also gives information as to how well model performance is holding up and might indicate when a model needs to be refit.  To the extent that the differences between your Training and Validation data set are due to Temporal Infidelity, the models and conclusions from your analysis might be skewed somewhat.  For future data collection, try and make sure that you are trying to obtain data for Training/Validation from common time periods, but it does not prevent you from analyzing the data even if you cannot do so.  You just need to understand the additional assumptions which are needed to draw analytical conclusions.

 

Hope this helps,

Doug 

View solution in original post

4 REPLIES 4
DougWielenga
SAS Employee

The nature of the time windows for your data sets might indicate that you (or your data providers) are using different terminology than SAS Enterprise Miner uses.   For example, you wrote

 

Zone1's training and testing data sets  are from 01Jan2015 to 30Jun2016.

Zone1's Validation data sets are from 01Mar2015 to 31Aug2016. 

 

which indicates that you have two data sets which you called training and testing and a third data set from a partially overlapping but later period as the validation data set.   This is not an uncommon situation when the first time period has data for training and validation data sets and the data set gathered later is used as the testing data set.  

 

In SAS Enterprise Miner, the training and validation data sets are intended to represent the same population (in this case, the same time period).  Candidate models are typically built on the training data set and the best model is chosen based on the validation data set.  The data set from a later time is used as the testing data set and is used to obtain an unbiased estimate of model performance. In your case, you might need to confirm how they are defining the roles for the data used for testing and the data used for validation since the terminology is sometimes reversed although the usage is typically the same as I described.  Check to make sure but I suspect you can use what you referred to as your testing data set as the validation data set and vice versa.   

 

Hope this helps!

Doug

EC189QRW
Obsidian | Level 7

Dear Doug,

Thank you for your clarification about training validation and testing data. Let me put my question this way, In SAS Em terminology, I was trying to create a training and testing data set from different regions in a country, however the sampling data from each regions all have different time windows, there might a few overlaps. I just combined them in two data steps, training and testing.

As you mentioned "In SAS Enterprise Miner, the training and validation data sets are intended to represent the same population (in this case, the same time period"

Can i consider the combined data set from different region and different time window as a full picture of the population? Can i use them as training and testing data sets in Model building process.  

Thank you!

DougWielenga
SAS Employee

EC189QRW,

 

In your example, you are dealing with a potential problem sometimes referred to as "Temporal Infidelity" which means it is possible the relationship among the variables changes over the time span.   This change could be due to normal changes (e.g. seasonal changes) or key events (e.g. major election, economic upturn or downturn, changing market share of product, etc...).  When this problem is present, the data collected at differing times are modeling different relationships (that is, a different "population") due even if it measuring the same people in a different time period.  Consider modeling the sales at the beach when you have some data from the high-traffic summer months and some data from the mid-fall to mid-spring "off-season".  Even if the data sets are partially overlapping, you could get some strange results.   

 

Dealing with real data is often messy.  Analysts are often brought in to 'analyze' data that was collected in ways that violate the usual underlying assumptions to some extent. In those cases, it is up to the individual analyst to assess the level of concern that should be given to those violations. The data does not reflect a perfect situation but it is still data.  If I could retain a very high percentage of my data from time periods present in all data sets, there is probably not a reason to include the "extra", but this strategy is more concerning when taking this approach means ignoring a substantial portion of my data.  If you do use data from non-overlapping periods, you are essentially assuming that there has been no change and that the differences are due to factors in your data rather than due to "Temporal Infidelity".  This is less of a concern when it is the Testing data that is collected at a later time since the Testing data is really intended to give you an unbiased estimate of model performance.   Scoring and evaluating more recent data also gives information as to how well model performance is holding up and might indicate when a model needs to be refit.  To the extent that the differences between your Training and Validation data set are due to Temporal Infidelity, the models and conclusions from your analysis might be skewed somewhat.  For future data collection, try and make sure that you are trying to obtain data for Training/Validation from common time periods, but it does not prevent you from analyzing the data even if you cannot do so.  You just need to understand the additional assumptions which are needed to draw analytical conclusions.

 

Hope this helps,

Doug 

EC189QRW
Obsidian | Level 7

Dear Doug,

You did help me a lot.Thank you so much!!

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1395 views
  • 1 like
  • 2 in conversation