02-02-2017 06:37 AM
I have a dataset for classification with a very unbalanced data (2% to 98%). For my master's thesis I want to try out and efficiently compare both various sampling techniques (under, over, SMOTE) and different algorithms (Logistic regression, random forest, ANN, etc.). Since I want to do a two dimensional analysis - try every algorithm on every sampling technique, the number of predictive models will be quite high. Do you please have some tips on how to efficiently create and organize such a workflow in SAS EM, and how to efficiently store the results in one final table and visualize the results in a meaningful way?
Thank you very much
2 weeks ago
SAS Enterprise Miner allows you to build multiple parallel modeling paths which you can connect back to a common Model Comparison node. You can also connect multiple sampling nodes to the same Input Data Source and build separate paths through multiple models from each and then compare all of the models in a Model Comparison node.
Comparing competing models is very straightforward when all of the data comes from the same data set. Combining different sampling strategies and modeling strategies for a specific data set confounds the results in that you cannot be sure whether the impact is due to the sampling strategy or the modeling strategy. Are there number of observations the same in all samples? If not, comparing the models becomes more problematic. Are you using the same Training and Validate (and possibly Test) data set to sample from? If not, you are comparing different models on data from different populations which makes the results less clear.
Since SAS Enterprise Miner is intended for modeling extremely large data sets, it is often not necessary to sample at all except in extremely rare event scenarios. There are analytical scenarios which can benefit from sampling but the best sampling strategy and modeling strategy for a given data set will not necessarily provide the best strategy for another data set. SAS Enterprise Miner provides a wealth of modeling methods precisely because different data sets can be 'best' solved by different modeling strategies. By 'best', I mean taking the business objective into account. Is the analyst hoping for more interpretability or simply looking for the best prediction according to some metric? Are the models which perform 'best' including variables which are inexpensive or difficult to come by when a simple model using readily available data performs just as well? The challenge in any scenario is that depending on your definition of 'best', you might get very different answers.
Hope this helps!