03-24-2015 08:01 AM
I am building my model ' Logistic Regression' in EM on a sample ( from 01/01/2014 to 31/12/2014) and I've kept a hold out sample from 01/01/2015 onwards so I can test if my model is robust (Works well on new data). The objective is to predict customers who are more like to purchase a product after a communication date.
I work in Energy company, and they might have some offers going on one year and not the other etc...as my modelling build dataset is 1 year (2014), and my out of sample 2 months only...(2015)
How can I handle this ? Because of saisonality etc...will my model works well on the out of sample?
Your help would be much appreciated
04-12-2015 06:23 PM
How is it going with your logistic models?
You are talking about two very important things: seasonality and choosing a good partition data set. A few comments below.
Seasonality is related with certain patterns in your target variable. For example let's say that there is always a spike of events right on Easter, 4th of July, Thanksgiving, and the week of Christmas. If you were to train a model that predicts the probability of a customer taking your offer in the next 6 months, you need to take into account those spikes for both your input variables and your prediction.
Your model the way you are describing it uses a whole year for the training set, so all these spikes are getting summarized. But remember to account for seasonal peaks if you start building variables that compare tendencies in two periods of time. For example for the variable (Balance in Q4)/(Balance in Q3) consider if you want to exclude or readjust the Q4 Balance to account the week of Christmas.
In a similar way, you can train a model that uses the 12 previous months to predict a behavior in the next 6 months. But you will have to use your business knowledge to determine if this model will underperform for certain periods of time with unusual customer behavior that has been observed yearly for a given season.
Choosing the right partition is also very important. There is no right or wrong approach. Different approaches suit different business needs.
For the one you describe, training with a data set for 2014 to predict an event in 2015 is OK as long as you are using a partition node to train your model. In other words your flow would look like this: 2014 data set -> Partition (70% Training and 30% validation) -> Several model nodes ->Model Comparison.
If you want to use a 2015 data set as a testing set, it needs to include the same variables as the 2014 data set, and the target variable. Set the role as "test" and connect it to the Model Comparison node.
I hope this helps,