Using SAS Enterprise Miner for Predictive Modeling Simulation Studies

Obviously data quality for analytics is a very important topic for us data scientists. We need data with good quality in order to be able to build our machine learning models. In my SAS Press Book, "Data Quality for Analytics Using SAS," I have outlined data quality criteria from an analytics and data science point-of-view. In this context, I discuss topics like data availability, data correctness, data completeness, data quantity and special considerations for predictive modeling like the need for historic snapshots of the data.. I show methods in SAS on how you can profile and improve data quality.

Sometimes you have to "live" with bad quality data for your analysis

There are, however, cases where you cannot improve data quality retrospectively. So sometimes you have to "live" with the actual situation for your analysis.

In this case, you have to re-formulate your business question or perhaps even cancel your project. So it's important to understand the consequences of bad data quality and its impact on your analytical models. In part III of my book, I show a large number of simulation results for predictive models and time series forecasting models. In these scenarios, I gradually decline the quality of the analysis data by inserting missing records, altering analysis values or reducing the number of observations and the length of the time history.

For simulation case studies in predictive modeling, I use SAS Enterprise Miner to perform and assess the different scenarios. This article shows how you can perform such a task with SAS Enterprise Miner and what you have to consider when doing so.

In a separate blog, I will outline more details on the definition and the results of the simulation case studies. You can also find some details about simulation case studies in presentation #102 of my data science presentation pool on GitHub.

The following process flow shows how a data has been "changed" per scenario to be compared with the reference model on clean data.

Description of the simulation environment for predictive models in SAS Enterprise Miner

I've performed simulations with SAS Enterprise Miner for the main functional consequences of bad data quality. In these main simulations, crucial parameters are varied systematically (for example, the percentage of missing values that's introduced into the data). These simulations are called “specific simulations” in the context of this chapter.

A specific simulation, for example, is the scenario with 10% random missing values. For each specific simulation, a number of iterations are applied. In each iteration, the data partitioning, as well as the sampling, is redone. This results in different analysis data sets for each iteration of a specific simulation.

The simulations have been performed with SAS Enterprise Miner on a SAS 9 platform. The analysis and preparation of the results have been performed with Base SAS and SAS/GRAPH software.

As an example, the nodes in the flow chart describe the process for simulating the impact of missing value imputation on the performance of a logistic regression model for the prediction of a churn event.

Input data source node CHURN

In this node, the simulation data are defined (for this example, the CHURN data). The simulation was run for the AFFNITY, INSURANCE, and RESPONSE data sets as well. Note that in this node, only variables that are set as input variables have been selected for the respective reference model.

START GROUPS and END GROUPS node

The start groups node, together with the end groups node, defines a loop in the process flow. The nodes between the two nodes are executed a specified number of times. The loop number has been set to 10. Thus, for each data set, 10 iterations are performed, which results in 40 iterations in total. For each iteration, the data partition and treatment of the data in a SAS code node varies the data quality status and are randomly differentiated.

DATA PARTITION node

The data partition node has been used to partition the available data into training, validation, and test data with a ratio of 40:30:30. Note that the XML file of this node in the SAS Enterprise Miner installation has been edited to allow a zero seed value for the partitioning. The zero seed value is important to ensure a different initialization value for the generation of the random values with every run.

INSERT MISSING VALUES node

The insert missing values node is a SAS code node that randomly inserts a certain proportion of missing values for each input variable. Again, the program uses a negative seed value to allow varying random values with every simulation run.

IMPUTE MISSING VALUES node

The impute missing values node uses standard SAS Enterprise Miner functionality to impute the previously generated missing values.

Note that the insert missing values node and the impute node are examples of how to introduce a certain change in data quality. In some scenarios, perturbed data are used directly for the model training; in other scenarios, like this one, the data are treated to improve data quality.

REGRESSION node

The regression node trains a full regression model without variable selection on the input data.

MODEL COMPARISON node

The model comparison node calculates validation statistics on the training, validation, and test data. This node automatically stores the respective data in a data set.

STORE ASSESSM. STATISTICS node

The Store Assessm. Statistics node is a SAS code node that stores the table with assessment statistics for the respective run in a separate data set. This data set holds the loop indicator of the respective loop that is defined by the START and END group nodes. The content of this data set is used in the evaluation phase of the simulation runs, where it is appended to a results repository.

Preparing SAS Enterprise Miner to allow a "SEED=0" when generating random numbers

Non-positive seed values are important in simulations, as they cause a different split into training and validation data each time the node is run. You need this for simulations studies to be able to generate different scenarios instead of just repeating the same scenario over and over again.

SAS Enterprise Miner does not allow, in his standard settings, a non-positive seed in the SAMPLE or the PARTITION node. To set a non-positive seed, the following changes to the XML file of the PARTITION or SAMPLE node are necessary.

Note that in general, it is not advisable to edit the standard XML files that define the properties of the standard SAS Enterprise Miner nodes. Thus, the following steps should only be taken with care.

Changing the settings

Navigate to the “components” directory on your analytics platform. This is in the directory <sasinstall>\Lev1\AnalyticsPlatform\apps\EnterpriseMiner\conf\components where <sasinstall> is the directory that has been specified during the SAS installation.
Make a backup copy of the PARTITION.XML file.
Edit PARTITION.XML.
Navigate to the Property RANDOMSEED.
Make sure that RANGE MIN is set to 0 and the EXCLUDEMIN is set to “N.”

    <Property type="int" name="RandomSeed"
             displayName="properties.common.randomseed.txt"
             description="randomseed.desc.txt"
             initial="12345">
      <Control>
         <Range min="0" excludeMin="N" />
      </Control>
    </Property>

Close and save the PARTITION.XML file.

It's now possible to enter a non-positive value as a random seed in the PARTITION node of SAS Enterprise Miner:

Building the simulation environment in a process flow

Overview

em flow.png

The simulations are being looped with START/END Groups Nodes. Note the MODE in the START GROUPS Nodes has to be set to INDEX. The number of iterations must be specified with the “Index Count” property.
The Model Comparison node is used to calculate the assessment statistics. These results are then prepared and outputs to a separate data set by a consecutive SAS Code Node.
Note that after each iteration such a data set is being output.
The name of this data set needs to be unique for each iteration run to avoid overwriting.
The advantage of having results immediately available after the first run allows for an early look into the results to check whether the simulation produces plausible results or whether the simulation will be stopped because some parameters have possibly been set in a wrong way.
After the last iteration of the simulations, the results are appended to a central result repository. This is important as it provides a central storage of simulation results that can be prepared and analyzed in a standardized way.

Deriving the parameter setting from the node name

The parameter setting that differentiates the scenarios from each other has been used to name the modeling node in the scenario. See Figure C.3.

As a result, for each scenario the setting 0, 10, 20, … is written into the MODELDESCRIPTION variable in the output data of the “Model Comparison Node” and can therefore be used for further processing.

Programming details

All macros that control the preparation of the data for the simulations—for example, insertion of missing values, the deletion of observations, and the insertion of biases into the data—have been written and maintained in a central program. This program is included in the START UP code of the SAS Enterprise Miner project and, thus, makes the macros available to each code node. This allows for central maintenance of the whole simulation project.
Missing values with flexible start values have been generated in the SAS code node with the CALL MISSING statement as described in section 15.4.
Macro variables and macros in the code nodes of SAS Enterprise Miner have been used to keep the program as flexible as possible. For example:

select count(*) 
into :ne_trn 
from &em_import_data
where %em_target = 1;

%do i = 1 %to &em_num_input;     
  call ranuni(seed, rnd);                                                                                                                                                                               
  if rnd le &pct_missing then call missing(%scan(%em_input,&i));
%end;

Discussion of the suitability of SAS Enterprise Miner for a simulation environment

There is a single disadvantage of using SAS Enterprise Miner for the simulations: the definition of the different scenarios is a little bit more bulky than it would be if all settings were to be maintained in a single code file.

If a simulation with 10 different parallel arms—for example, percentages of missing values with 0%, 10%, 20%, … 90%—is defined, the nodes for each arm need to be copied and named accordingly and the settings need to be made in the code node.

Advantages outweigh disadvantages

The advantages of using SAS Enterprise Miner for these simulations weigh, however, much higher. SAS Enterprise Miner provides tools that perform many elements of the tasks that are needed for the respective simulation runs. Otherwise all these tasks would need to be programmed manually. For example:

Imputation of missing values
Standardization of missing values
Partition of the data in training, validation, and test partitions
Creation of the forecast model on the training data and application of the model to the other data partitions
Calculation of the assessment statistics with the Model Comparison Node
The process flow in SAS Enterprise Miner visualizes the simulation flow and breaks it down to a modular structure that helps document the content better.
SAS Enterprise Miner automatically runs parallel streams in the flow on different processors. This saves a lot of time for the simulation runs. Outside of SAS Enterprise Miner, this would need to be programmed manually.
SAS Enterprise Miner allows STOPPING a process in a save and clean way. This is important when a process needs to be stopped because the preliminary analysis of the first results shows that the settings need to be changed.

Selected macros and macro variables available in a SAS Enterprise Miner code node

The following macros and macro variables available in the code node in SAS Enterprise Miner have been used for the simulation environment:

Macro or macro variable	Explanation
%EM_TARGET	Resolves to the variables that have a model role of target. The target variable is the dependent or the response variable.
&EM_IMPORT_DATA	Resolves to the name of the training data set.
&EM_EXPORT_TRAIN	Resolves to the name of the export training data set.
&EM_IMPORT_VALIDATE	Resolves to the name of the validation data set.
&EM_EXPORT_VALIDATE	Resolves to the name of the export validation data set.
&EM_IMPORT_TEST	Resolves to the name of the test data set.
&EM_EXPORT_TEST	Resolves to the name of the export test data set.
%EM_INPUT	Resolves to the variables that have a model role of input. The input variables are the independent or predictor variables.
&EM_NUM_INPUT	Resolves to the number of input variables.
&EM_NUM_INTERVAL_INPUT	Resolves to the number of interval input variables.
%EM_INTERVAL_INPUT	Resolves to the interval variables that have a model role of input.