BookmarkSubscribeRSS Feed
gabon
Calcite | Level 5

Hello,

 

I have a dataset for classification with a very unbalanced data (2% to 98%). For my master's thesis I want to try out and efficiently compare both various sampling techniques (under, over, SMOTE) and different algorithms (Logistic regression, random forest, ANN, etc.). Since I want to do a two dimensional analysis - try every algorithm on every sampling technique, the number of predictive models will be quite high. Do you please have some tips on how to efficiently create and organize such a workflow in SAS EM, and how to efficiently store the results in one final table and visualize the results in a meaningful way?

 

Thank you very much

1 REPLY 1
DougWielenga
SAS Employee

SAS Enterprise Miner allows you to build multiple parallel modeling paths which you can connect back to a common Model Comparison node. You can also connect multiple sampling nodes to the same Input Data Source and build separate paths through multiple models from each and then compare all of the models in a Model Comparison node.  

 

Comparing competing models is very straightforward when all of the data comes from the same data set.  Combining different sampling strategies and modeling strategies for a specific data set confounds the results in that you cannot be sure whether the impact is due to the sampling strategy or the modeling strategy.  Are there number of observations the same in all samples?  If not, comparing the models becomes more problematic.   Are you using the same Training and Validate (and possibly Test) data set to sample from?   If not, you are comparing different models on data from different populations which makes the results less clear. 

 

Since SAS Enterprise Miner is intended for modeling extremely large data sets, it is often not necessary to sample at all except in extremely rare event scenarios.  There are analytical scenarios which can benefit from sampling but the best sampling strategy and modeling strategy for a given data set will not necessarily provide the best strategy for another data set.   SAS Enterprise Miner provides a wealth of modeling methods precisely because different data sets can be 'best' solved by different modeling strategies.  By 'best', I mean taking the business objective into account.  Is the analyst hoping for more interpretability or simply looking for the best prediction according to some metric?  Are the models which perform 'best' including variables which are inexpensive or difficult to come by when a simple model using readily available data performs just as well?   The challenge in any scenario is that depending on your definition of 'best', you might get very different answers.  

 

Hope this helps!

Doug

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 844 views
  • 0 likes
  • 2 in conversation