We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Tip: How to Automate the Modeling of Multiple Targets in SAS® Enterprise Miner™

by SAS Employee rayIII on ‎05-05-2015 10:18 AM - edited on ‎10-06-2015 11:46 AM by Community Manager (1,649 Views)

 

 

Suppose your training table includes multiple target variables. The targets might represent customer responses to various credit card offers, or whether or not they purchased different products. If you build a modeling flow by connecting an Input Data Source node to a modeling node, only one of the targets will be modeled. So if you want to build models for each target, you need to build a separate flow for each target, right? (You may be wondering---what if I have many, or even hundreds of targets?)

 

Well, actually there is an easier way. The Group Processing nodes can automate the processing of multiple targets. This approach has the following advantages:

 

  • It can be carried out with a very simple flow
  • It allows you to easily compare modeling performance across targets.
  • It allows for the flexibility of finding a different champion algorithm for each target.

 

Flow

 

Consider the following Enterprise Miner process flow:

 

image001.png

 

This is a basic predictive modeling flow with a twist: I’ve bracketed the modeling nodes with group processing nodes. The flow does the following:

 

1.   Defines a training sample. I generated the data via Data step code. The columns include 10 interval inputs and several binary targets. Each of the targets is engineered to yield a particular misclassification rate (from zero to 50%) for a logistic regression model. See Appendix for details and the code.

 

2.   Defines metadata. For each of the generated target variables, I used the Metadata node to set Role=Target and Level=Binary.

 

3.  Runs HPRegression and HPTree models for each target. To have Enterprise Miner loop over each of the targets, I set “Target” as the processing mode in Start Groups node.

 

image003.png

 

4.   Finds the best model, either logistic regression or decision tree, for each target.

 

 

Results

 

The End Group node displays the model performance for each target. As expected, the misclassification rate varied from 0 (perfect classification) to 50% misclassification for my targets.

 

image005.png

 

The Model Comparison node compares the logistic regression and decision tree models and chooses a champion for each target:

 

image007.png

 

In this case logistic regression was the best method for each target. That was no surprise given how I generated the data.

 

By the way, here are the ROC curves and predicted probability plots for each of the targets, which I produced separately in SAS. Each row represents a different target, ordered from least to most error. This ordering accounts for the gradual flattening of the ROC curve and predicted probability curves as you read down the rows.

 

 

image009.png

 

 

 

 

 

Score Code

 

When you use the group processing nodes in Target mode, the score code available in the End Groups node includes a block of code for each target so you can score new records on each target in a single pass.

Just be aware that if you use the Score node to generate fixed output names like EM_CLASSIFICATION, fixed names are generated for only one of the targets. For example:

 

image011.png

 

Conclusion

 

We’ve seen how to use the group processing facility to easily generate predictions for multiple target variables in Enterprise Miner. This approach not only helps keep your flows nice and compact, but also makes it easy for you to directly compare modeling performance for different targets. For example, an auto manufacturer could easily determine whether they are better able to predict purchases of say, sedans or sports cars. Target mode also allows the flexibility to identify different champion algorithms for different targets.

 

Keep in mind that my simple group processing flow uses a fixed training sample for each of the targets. That may not always be wise with real world data. For example, if your targets represent purchases of offers to different groups of customers like prepaid and contract customers, then you may need to build those models using different sets of observations. Also consider whether it makes sense to use the same set of inputs for each target. In particular, are there inputs that should be used for some targets but rejected for others?

 

For more group processing examples, see: The Power of the Group Processing Facility in SAS® Enterprise Miner™. Sascha Schubert.

 

Appendix: Data Generation Code

 

This macro generates a binary target that is a function of all of the interval inputs. To get additional targets with varying misclassification rates, I randomly flipped some of the target values from 0 to 1 or  vice-versa.

 

The code lets you control the number of interval inputs, number of observations, number of targets and their expected misclassification rates (specified as a list), as well as the scoring cutoff. By default the macro generates 10 inputs, 25000 observations, and five targets.

 

%macro generate_binary_data(inputs=10, obs=25000, pctReversalsList= 10 20 30 40 50, p_cutoff = .5);

     data &em_export_train. (drop=rows inputs);

               %let n_targets = %sysfunc(countw(&pctReversalsList));

               array x {&inputs.};

               do rows = 1 to &obs.;

                      do inputs = 1 to dim(x);

                                  /*random x {-25,25}*/

                                      x{inputs} = 50*ranuni(0)-25;

                      end;

 

                     *binary target (y{0,1}) computed as a function of inputs;

                      y_binary_0pcterror = exp(sum(of x{*})) / (1 + exp(sum(of x{*}))) GE &p_cutoff.;

 

                      *additional targets with specified percentage of reversals;

                      %do t = 1 %to &n_targets;

                              %let current_pct = %sysfunc(scan(&pctReversalsList,&t));

                              y_binary_&current_pct.pcterror= y_binary_0pcterror;

                      

                              *introduce error by randomly flipping some values of the binary target;               

                              if ranuni(0) GE 1 - (&current_pct / 100) then

                                      do;

                                             if y_binary_&current_pct.pcterror= 0 then y_binary_&current_pct.pcterror = 1;

                                                else y_binary_&current_pct.pcterror = 0;                             

                                      end;

              

                       %end;

                       output;               

               end;

     run;

%mend generate_binary_data;

%generate_binary_data;

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.