The following products are analyzed in this benchmark study:
http://support.sas.com/resources/papers/Benchmark_R_Mahout_SAS.pdf
Hadoop); SAS Enterprise Miner 12.1 client
SAS Enterprise Miner 12.1, SAS 9.3
The SAS applications are commercial products. The R2.15.1 and the Mahout 7.0 are open source. There were three methods that were used across all of the applications to build the models: logistic regression, decision tree, and random forest.
One thing that was not clear from this study was how the results were derived. Although under the section called "Model Quality" the study states that “Standardized training and validation data sets stratified by the target were used across the model test suite,” it’s not clear as to what standards were applied. Were the standards based on how a human would classify the events or by some other method? If so, what was the method?
Hi,
Perhaps knowing that standardized training and validation data sets stratified by the target were used across the model test suite mean that all of the models for each package were evaluated using the same data.
And as cited: model quality was assessed through common model quality measures (Han & Kamber 2006), ie.
• cumulative lift in the first decile,
• percentage of correctly classified events (often called event precision), and
• overall percentage of correct classification
Since the analysis used historical data, the event value for the target is known. You need historical data with known values to do predictive modeling. The predictions from each model were then compared on the KNOWN common validation data to evaluate model quality using the statistics above.
As a side note - oftentimes there is an improvement in predictive model performance with the inclusion of variables derived from text data. At last years Analytics 2012 event, United Health Group - indicated that they generally found that predictive models improved significantly when variables from text data were added to the algorithms - citing for example that the missclassification rate (from 30% to 10%).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.