BookmarkSubscribeRSS Feed
JuliaM
Calcite | Level 5

The following products are analyzed in this benchmark study:

http://support.sas.com/resources/papers/Benchmark_R_Mahout_SAS.pdf

  • SAS High-Performance Analytics Server 12.1 (using

            Hadoop); SAS Enterprise Miner 12.1 client 

  • Rapid Predictive Modeler for SAS Enterprise Miner and

          SAS Enterprise Miner 12.1, SAS 9.3

  • R 2.15.1 “Roasted Marshmallows” version (64-bit)
  • Mahout 7.0

The SAS applications are commercial products. The R2.15.1 and the Mahout 7.0 are open source. There were three methods that were used across all of the applications to build the models: logistic regression, decision tree, and random forest.

One thing that was not clear from this study was how the results were derived. Although under the section called "Model Quality" the study states that “Standardized training and validation data sets stratified by the target were used across the model test suite,” it’s not clear as to what standards were applied. Were the standards based on how a human would classify the events or by some other method? If so, what was the method?

1 REPLY 1
FionaMcNeill
SAS Employee

Hi,

Perhaps knowing that standardized training and validation data sets stratified by the target were used across the model test suite mean that all of the models for each package were evaluated using the same data.

And as cited: model quality was assessed through common model quality measures (Han & Kamber 2006), ie.

•             cumulative lift in the first decile, 

•             percentage of correctly classified events (often called event precision), and 

•             overall percentage of correct classification 

Since the analysis used historical data, the event value for the target is known. You need historical data with known values to do predictive modeling. The predictions from each model were then compared on the KNOWN common validation data to evaluate model quality using the statistics above.

As a side note - oftentimes there is an improvement in predictive model performance with the inclusion of variables derived from text data.  At last years Analytics 2012 event, United Health Group - indicated that they generally found that predictive models improved significantly when variables from text data were added to the algorithms - citing for example that the missclassification rate (from 30% to 10%).

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1131 views
  • 3 likes
  • 2 in conversation