About mstell

DougWielenga · ‎08-14-2017

If you have 69,000 records with a 0.8% response rate, that only represents 552 observations. Assuming you kept all of your events and undersampled your non-events so that the 552 events represent 40% of your sample, you only have 1,380 total observations in your training data set. If you do any partitioning, that drops the number even further. There are several issues to consider in this scenario such as 1. You have a limited number of events -- likely too few to consider splitting the raw data into training and validation, so I would recommend considering using cross-validation methods in your modeling nodes. 2. You only have 828 non-events out of 69,000 (roughly 1.2%) which is relatively small so it is possible (even likely) that the nature of your non-events is varying quite a bit as you change the seed. 3. If you have any missing values, your sample is even smaller unless you impute the missing values and/or use a method (e.g. Decision Tree) which does not rely on complete observations. 4. If you have variables that are highly related to one another (be it linearly or otherwise), you can see very different models from slightly different samples of the input data. Decision Trees are highly unstable and can look dramatically different even though the underlying predictions might be similar. You have several things that you might try to do: 1. Use the cross-validation options (they differ from node to node) 2. Take a larger proportion of non-events (if so, set up a target profile using the Decisions... capability in the Input Data Source node and use the Default with Inverse Prior Weights... option) 3. Try using the Memory-Based Reasoning node which uses one model to isolate easily classified observations and then fits a model to the remaining observations. In this way, you are likely to avoid oversampling and can use your entire data set. 4. Fit a forest using the HP Forest node which will take samples of observations and variables and fit separate models which can then be combined into a final model. You don't have a lot of observations, so depending on whether you have a lot of variables, you might find one or more of the methods described above to provide you more. I hope this helps! Doug

mstell · ‎10-27-2011

Thanks for both of your responses! I definitely needed the trim() function and it looks like the mode I using was not working correctly. When I use open(trim(mail_file), , , 'F',) everything works as it should. Thanks again.

Online Status	Offline
Date Last Visited	‎09-01-2015 07:11 AM

Change to Oversampling seed creates different results.

Open Function with Variable for Data Set Name

Open Function with Variable for Data Set Name

Re: Change to Oversampling seed creates different results.

Open Function with Variable for Data Set Name