BookmarkSubscribeRSS Feed

Tip: Use the Control Point Node for Simpler, Reusable Process Flows

Started ‎06-11-2014 by
Modified ‎10-06-2015 by
Views 6,910

Process flow diagrams are a great way to create and visualize a data mining process. But you probably find that after working on a process flow for a while it tends to get cluttered, even to the point where connections between nodes seem to run in every direction.

 

Cluttered diagrams can be harder to follow and more difficult to work with than clean, well-organized diagrams. But the Control Point node can help. The Control Point node, available in the Utility menu, allows you to reduce the total number of connections between nodes, run parallel flows or branches with a single gesture, and even transform your flows into reusable templates.

 

 

Examples

 

1. Eliminate unnecessary connections

 

 

The Ensemble node in SAS® Enterprise Miner™ allows you to build an ensemble model from several component models. In many cases, ensembles provide better lift or generalizability than the individual component models that make up the ensemble.

 

 

image001.png

Ensemble Flow

 

This flow builds three models (Decision Tree, Neural Network, and Regression) to predict loan default and compares them using the Model Comparison node. It also compares a fourth model, which is an ensemble comprised of the three individual models.

 

image003.png

 

Model Comparison Results

 

There is nothing wrong with this flow--it certainly does the job. But note that there are connections from the Data Partition node to each modeling node, from each modeling node to the Ensemble node, and from each modeling node to the Model Comparison node. If you wanted to try using different modeling algorithms (adding and removing modeling nodes from your ensemble), you would have to manage three sets of connections per modeling node. That could get tedious.

 

Here’s a better way:

 

Instead of connecting each modeling node directly to the Ensemble node, connect each to a Control Point node. Then connect the Control Point node to the Ensemble and Model Comparison nodes:

 

 

image004.png

 

 

The revised diagram does the same thing as the earlier version of our flow: it compares the three component models with the ensemble. But we have managed to eliminate some connections, simplifying the diagram.

 

What exactly does the Control Point node do here? Well, it doesn't produce any results. Rather, it has split the flow into two independent subflows--an upstream portion that builds component models and a downstream portion that ensembles the component models. In other words, we've encapsulated the modeling portion of the flow so the downstream part doesn't need to change when component models are added or removed.

 

2. Run modeling multiple flows at once

 

Here’s a simple flow that generates four predictive models:

 

 

image006.png

 

Pretty simple stuff. But to run each of the models you need to click on each modeling node and run it individually.  And if you later decided to override the default partition specifications or add a Transformation node upstream of the modeling nodes, you’d need to repeat the process, re-running each modeling node one at a time.

 

Here’s a better way. Simply add a Control Point to the end of your flow:

 

image007.png

Now to build all four models (and any predecessor nodes), you only need to run the Control Point node.

 

The same logic applies with more complicated flows. For example:

 

image008.png

 

 

Here, running the Control Point node runs both of the modeling building paths with a single gesture. That’s really handy when you want to start a big job and let it run unattended--it saves you from having to intervene after each subflow has finished to kick off the next subflow.

 

 

3. Swap datasets

 

When you are ramping up on a new data mining technique, just trying to understand how things work, you may find yourself running the same flow on different datasets. Of course, this is also the case when you retrain your production models using a new dataset.

 

Here, for example, I’ve built a simple flow to model the German Credit data (target=good_bad):

 

image010.png

 

 

Suppose I’ve run the above flow and now want to see what happens with the Home Equity dataset. Before I can do so, I have to remove the four connections from the German Credit dataset to the modeling nodes and establish four more.

 

image011.png

 

At least having the Model Comparison node at the end of the flow avoids the need to rerun each model individually. But here is a better way:

 

image012.png

 

This version makes it easy to switch input datasets. You would simply disconnect German Credit from the Control Point node then connect the Home Equity node to the Control Point node. You only need to manage one connection when swapping datasets.

 

Note that we’ve essentially created a "headless" diagram--a diagram template where the user only needs to choose the input dataset and run the flow.

 

Summary

 

Cluttered diagrams are much like cluttered programming code—inelegant and hazardous to your productivity. The Control Point node can help by simplifying your diagrams and encapsulating subflows so they work essentially independently of each other. With complicated flows, judicious use of the Control Point node could save you a lot of headaches.

 

The Control Point is very much a convenience feature—it analyzes no data, and produces no results. But nonetheless a very useful node to have in your data mining tool box.

Comments

Awesome.  After using EM for quite a few years, I only just discovered this node a little over a year ago.  It is a big help!  But it never occurred to me to put it at the start of my flows like you did!  Thanks for sharing!

Thanks, Jared. Glad you found it useful.

Thanks A lot, It is useful node.

It is helpful to see that this does not necessarily generate all the permutations.  If I had two variable select nodes going into a single model comparison group, how would I set that up?  How would I run all three models against all variable selection nodes.

 

In case you're wondering why one would run variable selection twice instead of once, it would be to investigate the worth of a particular set of variables to see if that data is worth purchasing again.

 

 

Version history
Last update:
‎10-06-2015 02:22 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags