Text mining and content categorization

Benchmark Study from SAS

Posts: 36

Benchmark Study from SAS

The following products are analyzed in this benchmark study:


  • SAS High-Performance Analytics Server 12.1 (using

            Hadoop); SAS Enterprise Miner 12.1 client 

  • Rapid Predictive Modeler for SAS Enterprise Miner and

          SAS Enterprise Miner 12.1, SAS 9.3

  • R 2.15.1 “Roasted Marshmallows” version (64-bit)
  • Mahout 7.0

The SAS applications are commercial products. The R2.15.1 and the Mahout 7.0 are open source. There were three methods that were used across all of the applications to build the models: logistic regression, decision tree, and random forest.

One thing that was not clear from this study was how the results were derived. Although under the section called "Model Quality" the study states that “Standardized training and validation data sets stratified by the target were used across the model test suite,” it’s not clear as to what standards were applied. Were the standards based on how a human would classify the events or by some other method? If so, what was the method?

SAS Employee
Posts: 17

Re: Benchmark Study from SAS


Perhaps knowing that standardized training and validation data sets stratified by the target were used across the model test suite mean that all of the models for each package were evaluated using the same data.

And as cited: model quality was assessed through common model quality measures (Han & Kamber 2006), ie.

•             cumulative lift in the first decile, 

•             percentage of correctly classified events (often called event precision), and 

•             overall percentage of correct classification 

Since the analysis used historical data, the event value for the target is known. You need historical data with known values to do predictive modeling. The predictions from each model were then compared on the KNOWN common validation data to evaluate model quality using the statistics above.

As a side note - oftentimes there is an improvement in predictive model performance with the inclusion of variables derived from text data.  At last years Analytics 2012 event, United Health Group - indicated that they generally found that predictive models improved significantly when variables from text data were added to the algorithms - citing for example that the missclassification rate (from 30% to 10%).

Ask a Question
Discussion stats
  • 1 reply
  • 2 in conversation