The SAS Rapid Predictive Modeler (RPM) task in SAS Studio not only lets you quickly and easily build predictive models using smart defaults, but it also creates an Enterprise Miner process flow and SAS code behind the scenes. Once the task in SAS Studio runs, you can open the Enterprise Miner process flow or the code and make any desired changes. The SAS Studio RPM task presents results in clear business terms, such as scorecards, lift charts, and variable importance. It automatically handles outliers, missing values, rare target events, skewed data, variable selection and model selection. Machine learning techniques such as neural networks and other data mining methods are used behind the scenes, and the best model is selected automatically.
You can further customize and tweak models using the Enterprise Miner GUI to edit the EM process flow that the RPM task creates, or by editing the code created behind the scenes. Models are registered in metadata to automate the execution of score code and make deployment to other systems easy. The SAS Studio Rapid Predictive Modeler task is useful for:
Business Analysts, who simply want a fast and accurate answer to their business question. The Business Analyst will generally accept the results of the RPM, as is.
Statisticians, who may want to open the Enterprise Miner flow to look under the covers, and adjust some of the defaults and add/subtract nodes as they see fit in an effort to incrementally improve model accuracy and results.
Data Scientists and Coders, who may use the Rapid Predictive Modeler to develop a coding template, which they can use at a starting point, to edit and amend.
Imagine you are interested in preventing auto accidents by issuing recalls on automobile parts that are likely to fail. In my example below, I start with a historic (notional) data set on auto parts that includes a binary target (dependent) variable TargetPartFailure. TargetPartFailure indicates whether or not the part failed: 1 = failure and 0 = no failure. Other variables include a unique ID variable (PartNumber), and input (independent) variables, such as PartType, PartAge, and NumIssuesReported.
RPM finds the best model based on the historic data. That model can then be applied to a completely new data set, which has no information on target part failure, but has the same inputs (independent variables) as the historic data set. This allows the analyst or manager to prioritize which auto parts should be further investigated for potential recalls.
The first step is to upload the data so that they are available in SAS Studio. Right click in the file where you want the data, and select Upload Files.
Navigate to the physical file where your data sets are stored, and select the files you want to upload.
Drag and drop your data from the navigation pane on the left into the work area on the right. The data should automatically load into a _TEMP library.
Next, expand the available Tasks, then expand the Data Mining subcategory. Double click on the Rapid Predictive Modeler task.
Assign the target variable TargetPartFailure the role of Dependent Variable. Unlike the Enterprise Miner interface, if your target variable starts with “target,” RPM will not automatically assign it the role of Dependent Variable; you must assign this role.
On the Options tab, under Model you may select Basic, Intermediate or Advanced. For this example, I select Advanced.
The Basic, Intermediate, and Advanced Model options are described in the SAS Studio 3.4 User’s Guide. I chose the advanced option, which evaluates the most models and then chooses the best performing model.
Under the Reports Option, you can choose Standard reports or Standard & additional reports. Check the reports you want to see.
On the Output tab, check each box and specify the names and folders you would like for your output data sets and save locations. To keep this information handy and avoid typos in a future step, you can use Ctrl + c and Ctrl + v to copy and paste the project data name (e.g., RPMAutoSafety20160209) and the folder (e.g., C:\Users\sasdemo\EMProjects) into Notepad.
The SAS Studio RPM task will automatically create the output you requested. You will recognize this output, because it is Enterprise Miner output! For example, you will see an ROC plot with the K-S statistic.
The better the model, the higher and farther to the left the ROC curve will be, maximizing sensitivity and minimizing 1-specificity (that is, maximizing true positives and minimizing false positives). In my example, we have a pretty good K-S statistic (higher/closer to one is better) of 0.72388 for the validation data and 0.73372 for the training data. It is a good sign that the K-S statistic is similar for both the training and validation data, indicating that we did not overfit the training data.
The SAS Studio RPM task created a SAS Enterprise Miner process flow behind the scenes. You can open that process flow in SAS Enterprise Miner to make any changes or additions to the flow that you want. Start by logging on to Enterprise Miner.
Open a new project. This is counterintuitive, but you definitely want to open a New Project.
Name the new project the exact same name as you named the output file in RPM, and browse to the same server directory as you indicated in RPM. This is why it is helpful to have copied that project name and server directory path into Notepad, to avoid any typos in this step.
When you hit Next you will get a Project Exist dialogue reading “The selected project exists on the filesystem. It may have been created by another user. Do you want to continue?” Click Yes. The project that was already created is the one you created using the RPM task in SAS Studio. Then click Next and Finish.
Open the Diagram and Voila! You see the Enterprise Miner process flow that you created with RPM in SAS Studio.
You can now use Enterprise Miner to make any changes or additions that you want.
If you'd like the sample data set used in this article, feel free to private message me via the community and I'll send it your way.