03-26-2012 12:59 PM
I am using Enterprise Miner 7.1 to create a response model for a direct response marketing campaign. My sample data consists of about 69,000 records with a response rate of 0.8%. I am oversample to 40% response 60% non-response.
I am trying differenct transformation techniques and modeling techniques and using the model comparrsion node to choose a model. Out of curiosity I change the seed in my sample node the creates my oversample. When I did this I saw changes to what models were being selected and what variables were being selected in those models.
What is causing this and should I be concerned about it?
I would greatly appreciate any thoughts.
Thank you in advance.
If you have 69,000 records with a 0.8% response rate, that only represents 552 observations. Assuming you kept all of your events and undersampled your non-events so that the 552 events represent 40% of your sample, you only have 1,380 total observations in your training data set. If you do any partitioning, that drops the number even further.
There are several issues to consider in this scenario such as
1. You have a limited number of events -- likely too few to consider splitting the raw data into training and validation, so I would recommend considering using cross-validation methods in your modeling nodes.
2. You only have 828 non-events out of 69,000 (roughly 1.2%) which is relatively small so it is possible (even likely) that the nature of your non-events is varying quite a bit as you change the seed.
3. If you have any missing values, your sample is even smaller unless you impute the missing values and/or use a method (e.g. Decision Tree) which does not rely on complete observations.
4. If you have variables that are highly related to one another (be it linearly or otherwise), you can see very different models from slightly different samples of the input data. Decision Trees are highly unstable and can look dramatically different even though the underlying predictions might be similar.
You have several things that you might try to do:
1. Use the cross-validation options (they differ from node to node)
2. Take a larger proportion of non-events (if so, set up a target profile using the Decisions... capability in the Input Data Source node and use the Default with Inverse Prior Weights... option)
3. Try using the Memory-Based Reasoning node which uses one model to isolate easily classified observations and then fits a model to the remaining observations. In this way, you are likely to avoid oversampling and can use your entire data set.
4. Fit a forest using the HP Forest node which will take samples of observations and variables and fit separate models which can then be combined into a final model.
You don't have a lot of observations, so depending on whether you have a lot of variables, you might find one or more of the methods described above to provide you more.
I hope this helps!