Obviously data quality for analytics is a very important topic for us data scientists. We need data with good quality in order to be able to build our machine learning models. In my SAS Press Book, "Data Quality for Analytics Using SAS," I have outlined data quality criteria from an analytics and data science point-of-view. In this context, I discuss topics like data availability, data correctness, data completeness, data quantity and special considerations for predictive modeling like the need for historic snapshots of the data.. I show methods in SAS on how you can profile and improve data quality.
There are, however, cases where you cannot improve data quality retrospectively. So sometimes you have to "live" with the actual situation for your analysis.
In this case, you have to re-formulate your business question or perhaps even cancel your project. So it's important to understand the consequences of bad data quality and its impact on your analytical models. In part III of my book, I show a large number of simulation results for predictive models and time series forecasting models. In these scenarios, I gradually decline the quality of the analysis data by inserting missing records, altering analysis values or reducing the number of observations and the length of the time history.
For simulation case studies in predictive modeling, I use SAS Enterprise Miner to perform and assess the different scenarios. This article shows how you can perform such a task with SAS Enterprise Miner and what you have to consider when doing so.
In a separate blog, I will outline more details on the definition and the results of the simulation case studies. You can also find some details about simulation case studies in presentation #102 of my data science presentation pool on GitHub.
The following process flow shows how a data has been "changed" per scenario to be compared with the reference model on clean data.
I've performed simulations with SAS Enterprise Miner for the main functional consequences of bad data quality. In these main simulations, crucial parameters are varied systematically (for example, the percentage of missing values that's introduced into the data). These simulations are called “specific simulations” in the context of this chapter.
A specific simulation, for example, is the scenario with 10% random missing values. For each specific simulation, a number of iterations are applied. In each iteration, the data partitioning, as well as the sampling, is redone. This results in different analysis data sets for each iteration of a specific simulation.
The simulations have been performed with SAS Enterprise Miner on a SAS 9 platform. The analysis and preparation of the results have been performed with Base SAS and SAS/GRAPH software.
In this node, the simulation data are defined (for this example, the CHURN data). The simulation was run for the AFFNITY, INSURANCE, and RESPONSE data sets as well. Note that in this node, only variables that are set as input variables have been selected for the respective reference model.
The start groups node, together with the end groups node, defines a loop in the process flow. The nodes between the two nodes are executed a specified number of times. The loop number has been set to 10. Thus, for each data set, 10 iterations are performed, which results in 40 iterations in total. For each iteration, the data partition and treatment of the data in a SAS code node varies the data quality status and are randomly differentiated.
The data partition node has been used to partition the available data into training, validation, and test data with a ratio of 40:30:30. Note that the XML file of this node in the SAS Enterprise Miner installation has been edited to allow a zero seed value for the partitioning. The zero seed value is important to ensure a different initialization value for the generation of the random values with every run.
The insert missing values node is a SAS code node that randomly inserts a certain proportion of missing values for each input variable. Again, the program uses a negative seed value to allow varying random values with every simulation run.
The impute missing values node uses standard SAS Enterprise Miner functionality to impute the previously generated missing values.
Note that the insert missing values node and the impute node are examples of how to introduce a certain change in data quality. In some scenarios, perturbed data are used directly for the model training; in other scenarios, like this one, the data are treated to improve data quality.
The regression node trains a full regression model without variable selection on the input data.
The model comparison node calculates validation statistics on the training, validation, and test data. This node automatically stores the respective data in a data set.
The Store Assessm. Statistics node is a SAS code node that stores the table with assessment statistics for the respective run in a separate data set. This data set holds the loop indicator of the respective loop that is defined by the START and END group nodes. The content of this data set is used in the evaluation phase of the simulation runs, where it is appended to a results repository.
Non-positive seed values are important in simulations, as they cause a different split into training and validation data each time the node is run. You need this for simulations studies to be able to generate different scenarios instead of just repeating the same scenario over and over again.
SAS Enterprise Miner does not allow, in his standard settings, a non-positive seed in the SAMPLE or the PARTITION node. To set a non-positive seed, the following changes to the XML file of the PARTITION or SAMPLE node are necessary.
Note that in general, it is not advisable to edit the standard XML files that define the properties of the standard SAS Enterprise Miner nodes. Thus, the following steps should only be taken with care.
<Property type="int" name="RandomSeed"
displayName="properties.common.randomseed.txt"
description="randomseed.desc.txt"
initial="12345">
<Control>
<Range min="0" excludeMin="N" />
</Control>
</Property>
It's now possible to enter a non-positive value as a random seed in the PARTITION node of SAS Enterprise Miner:
The parameter setting that differentiates the scenarios from each other has been used to name the modeling node in the scenario. See Figure C.3.
select count(*)
into :ne_trn
from &em_import_data
where %em_target = 1;
%do i = 1 %to &em_num_input;
call ranuni(seed, rnd);
if rnd le &pct_missing then call missing(%scan(%em_input,&i));
%end;
There is a single disadvantage of using SAS Enterprise Miner for the simulations: the definition of the different scenarios is a little bit more bulky than it would be if all settings were to be maintained in a single code file.
If a simulation with 10 different parallel arms—for example, percentages of missing values with 0%, 10%, 20%, … 90%—is defined, the nodes for each arm need to be copied and named accordingly and the settings need to be made in the code node.
The advantages of using SAS Enterprise Miner for these simulations weigh, however, much higher. SAS Enterprise Miner provides tools that perform many elements of the tasks that are needed for the respective simulation runs. Otherwise all these tasks would need to be programmed manually. For example:
The following macros and macro variables available in the code node in SAS Enterprise Miner have been used for the simulation environment:
Macro or macro variable |
Explanation |
%EM_TARGET |
Resolves to the variables that have a model role of target. The target variable is the dependent or the response variable. |
&EM_IMPORT_DATA |
Resolves to the name of the training data set. |
&EM_EXPORT_TRAIN |
Resolves to the name of the export training data set. |
&EM_IMPORT_VALIDATE |
Resolves to the name of the validation data set. |
&EM_EXPORT_VALIDATE |
Resolves to the name of the export validation data set. |
&EM_IMPORT_TEST |
Resolves to the name of the test data set. |
&EM_EXPORT_TEST |
Resolves to the name of the export test data set. |
%EM_INPUT |
Resolves to the variables that have a model role of input. The input variables are the independent or predictor variables. |
&EM_NUM_INPUT |
Resolves to the number of input variables. |
&EM_NUM_INTERVAL_INPUT |
Resolves to the number of interval input variables. |
%EM_INTERVAL_INPUT |
Resolves to the interval variables that have a model role of input. |
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.