Compliance in a Parallel World: Distributed Algorithms and Reproducibility in Financial Risk Models

In today’s financial industry, financial risk management such as stress testing is no longer a periodic regulatory formality – it is a strategic pillar for risk management and capital adequacy planning. As data volumes grow exponentially and model complexity increases, financial institutions have embraced distributed computing platforms like SAS Viya to process large scale stress testing models efficiently.

However, parallelization introduces a counterintuitive challenge: the same model, run multiple times on the same dataset, can yield slightly different results depending on how data is distributed, and computations are executed across nodes. This introduces a legitimate concern for risk and compliance professionals: If a model does not yield identical results upon re-execution, can it still meet regulatory standards?

To explore this issue, we begin by recognizing that supervised machine learning models present a compelling alternative to traditional PD curve methodologies for estimating probability of default, and an increasing number of risk management solutions are acknowledging their potential and actively considering their adoption. Given this backdrop, let’s frame our discussion around some of the most widely used supervised models.

The Promise and Pitfalls of Distributed Algorithms

Distributed algorithms are gradually assuming an increasingly critical role in modern financial risk modeling, offering the ability to process massive datasets across multiple machines. Their primary promise lies in scalability, speed, and resilience, making it possible to train complex machine learning models in parallel, reduce execution time, and continue operations even if some nodes fail.

Traditional supervised learning models, such as ordinary least squares (OLS) and logistic regression, operate under well-understood mathematical principles. In a controlled, single-threaded environment, the results are precise and reproducible. However, when these models are implemented in a distributed computing environment, a number of factors complicate the story.

For instance, in a distributed computing environment

Data is partitioned across nodes or cores.
Calculations are performed in parallel.
Aggregation introduces numerical rounding.
Iterative optimization may vary in convergence paths.

As we will witness later, these subtle shifts in computation can lead to meaningful differences in the final model results.

The Role of Floating-Point Arithmetic

SAS uses floating-point representation to store numeric values. Floating-point representation is a way for computers to handle very large and very small numbers efficiently using limited memory (See Floating-Point Representation in SAS Viya Platform Programming Documentation for more details).

The core issue of non-reproducible model results in distributed computing environments stems from a key characteristic of floating-point arithmetic: it does not follow the associative property. In other words, when numbers are added together in different orders, the result can change slightly due to rounding errors.

In other words, in floating-point arithmetic,

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In distributed computing, operations such as matrix multiplication and summation are performed in parallel across many partitions. Each node handles a subset of the data and computes partial results. These partial results are then aggregated. However, because the order of operations can differ based on node availability, system load, or partition size, the final result is subject to minor floating-point inconsistencies.

When calculating something as sensitive as regression coefficients – especially in models that rely on matrix inversions (like OLS) or iterative gradient updates (like logistic regression) – these small differences can propagate and grow.

Closed-Form vs. Iterative Methods

Another key consideration is the algorithmic approach used to compute the model results. For example, in OLS regression, the approved method for estimating the coefficients (the β) involves solving the normal equations (derived by setting the gradient of the loss function to zero) to obtain:

This closed-form solution requires matrix operations that can be difficult to scale in a distributed context. To manage memory and processing constraints, distributed systems often break these operations into smaller chunks or use approximations.

On the other hand, in logistic regression, the coefficients are estimated using iterative optimization algorithms like Newton-Raphson or Gradient Descent. These methods depend on initial values, convergence criteria, and step sizes. When run on multiple partitions, the optimization path may differ, leading to small but significant differences in results.

Randomization and Data Ordering in Training

In iterative methods the outcome of the process depends not only on the data and algorithm but also on the initial conditions and data handling sequence. A random seed determines the starting points or initialization used by the algorithm – for example, the initial weights in a neural network. If the seed is not explicitly fixed, the algorithm may initialize differently with each run, leading to different results even when the input data remains the same.

Similarly, the order in which data is processed can influence results in algorithms that rely on incremental updates (like stochastic gradient descent and mini-batch gradient descent). In a distributed computing environment, this order can vary due to how data is partitioned and processed across nodes. Without explicitly setting parameters that ensure consistent data shuffling, the model results may vary from one run to the next, making reproducibility a challenge.

Data Partitioning

The way data is split across partitions in a distributed computing environment has a significant impact on the representativeness of each partition. Ideally, each partition should reflect the overall structure of the full dataset, including similar distributions of key variables, target classes, and outliers.

When data is unevenly distributed, the models trained on different nodes may converge to different optima, depending on patterns that are most prominent in the respective partitions. This can affect not only the accuracy and generalization of the final aggregated model but also the reproducibility and stability of results across runs.

Hardware Differences

As noted earlier, floating-point operations can produce slightly different results across different CPU architectures due to variations in how these architectures handle numerical precision, rounding, and instruction ordering.

In a cloud environment, this issue is particularly important because the infrastructure is often heterogeneous. That means our computation may run across a mix of different physical machines with CPUs from different vendors (for example, Intel vs. AMD), different instruction sets (for example, AVX (Advanced Vector Extensions) vs. SSE (Streaming SIMD Extensions)), or even varying levels of floating-point arithmetic standards.

Note,

An instruction set is the collection of low-level commands that a CPU understands and can execute directly. These can be considered as the "vocabulary" of the CPU, defining what operations it can perform.
Instruction ordering refers to the sequence in which CPU instructions are executed. In parallel systems, instructions might not always be executed in the order they appear in the code. For example, in calculating A+B+C. depending on how the instructions are ordered, the processor may calculate (A+B)+C or it may do A+(B+C).

Generating Non-replicable Results Using Gradient Boosting Model

Let us first illustrate that non-replicability is indeed an issue when we run machine learning algorithms in a distributed environment. For this purpose, we are going to use the HMEQ (home equity) data set available in the SAMPSIO library in SAS Viya for Learners (See Home Equity Data Set in SAS Viya Platform Programming Documentation for more details).

Gradient Boosting is chosen because it relies heavily on iterative procedures that build the model sequentially to minimize prediction errors (residuals). Consequently, the final results are highly sensitive to randomization and the order in which the training data is processed.

SAS Viya for Learners provides the distributed computing architecture needed for demonstrating non-replicability. The platform leverages multi-node parallel processing through its Cloud Analytic Services (CAS) engine. CAS distributes data and workloads across a cluster of servers, enabling concurrent computation and faster results.

Assessing the replicability (or reproducibility) of gradient boosting models means determining whether the model yields consistent results across repeated runs under identical conditions. In this exercise, we’ll evaluate replicability by comparing the Area Under the Curve (AUC) metric across four independent runs of the same gradient boosting model.

Gradient Boosting Results

The code displayed above ran a Gradient Boosting Model (GBM) four times on an identical dataset, with no changes to model parameters or data partitioning. The only variation was the execution instance within the distributed computing environment. Below are the AUC (Area Under the Curve) values obtained using the same validation data across all four runs:

Although the range – approximately 0.0014 – appears small, even such marginal differences may carry weight in regulated environments like credit scoring or financial stress testing, where precision directly influences business outcomes and regulatory compliance.

SAS Code for Gradient Boosting Model

Refer to Appendix A1 for the SAS Code for Gradient Boosting Model with default settings.

Regulatory Perspective on Variations – Process Assurance and Repeatability

Regulators increasingly acknowledge that exact numerical reproducibility is often unattainable in machine learning models, particularly those built using distributed or parallelized systems. Instead, they prioritize transparency and awareness, emphasizing clear documentation of variability sources, model configurations, random seeds, software versions, and execution environments. Both the European Banking Authority (EBA) and the U.S. Federal Reserve (e.g., SR 11-7) stress the importance of robust model risk management, where understanding and explaining model behavior matters more than exact replication.

Minor differences in evaluation metrics – such as an AUC of 0.980156 vs. 0.981246 – are generally acceptable, provided they don’t lead to materially different outcomes in areas like credit approvals, IFRS 9 staging, or capital allocation. Regulators expect institutions to perform sensitivity analyses or back-testing to show that such variations don’t impact decisions or regulatory results significantly.

Repeatability and process control take precedence over strict determinism. Regulators focus on whether the modeling process is well-governed, validated, and monitored. As long as performance stays within a defined and reasonable range, and significant model drift is detectable over time, minor output variation is not a concern. This reflects a broader acceptance of non-determinism in modern ML environments.

Model governance must adapt by defining acceptable tolerance levels for performance metrics, implementing version control, and maintaining audit trails. Firms should also use benchmark scenarios with predefined output ranges to guide expectations. For example, in ICAAP or CCAR submissions, confidence intervals for key metrics are commonly used to account for uncertainty due to non-reproducibility.

The table below provides a summary outlook of various mainstream regulatory frameworks:

Achieving Reproducibility is Still Possible

Modern machine learning platforms, such as SAS Viya, offer configurations and tools specifically designed to enhance consistency and control across runs. While regulatory compliance does not mandate it, SAS Viya provides options that can be explicitly configured to yield reproducible results for machine learning algorithms like factorization machine, random forest, gradient boosting, and support vector machines.

Two such options are the “SEED=” and “APPLYROWORDER=” in the GRADBOOST procedure to generate gradient boosting models. The SEED= option in PROC GRADBOOST sets the random number seed used for controlling randomization across different runs. For example, a fixed “SEED=123” option will ensure that the model training process – such as how trees are constructed – is consistent across runs, assuming the rest of the environment is also controlled.

The APPLYROWORDER= option preserves the row order and distribution of the input data during the training of the gradient boosting model. Enabling the APPLYROWORDER option necessitates the use of the PARTITION action from the TABLE action set in PROC CAS.

We also use PROC PARTITION for data partitioning for modeling (e.g., train/validation/test splits), and this is different from PARTITION action which is used for data partitioning for downstream actions that preserves the row order for building reproducible machine learning workflows.

When using the PARTITION action, data is allocated to threads and workers based on a chosen partition variable. To maintain reproducible ordering within each thread, we can also define an ORDERBY key, which must uniquely identify each row.

SAS Code for Reproducible Gradient Boosting Model

Refer to Appendix A2 for the SAS Code used for generating reproducible Gradient Boosting Models.

Gradient Boosting Results with Fixed Initialization and Data Ordering

With fixed seed and preserved row order, all four gradient boosting model runs yield an identical AUC score of 0.975081 on the validation dataset. This demonstrates complete reproducibility in terms of predictive performance. The identical AUC values demonstrate that the model algorithm is deterministic under fixed configurations, despite being run in a distributed, parallelized environment.

05_SoumitraDas_bl04_2025_02_AUC_NoRandomization_Applyroworder.png

Non-Reproducibility May Still Persist

Finally, it's important to recognize that reproducibility in data distribution is not solely determined by the partition keys and ORDERBY keys. Even when the same dataset is used, along with identical partition and ORDERBY keys, the actual distribution of data across threads and workers can still vary if the underlying system architecture differs. For example, a system with 8 threads per worker may allocate data differently than a system with 16 threads, even when both are running the same code and using the same partitioning logic.

Additional Information

For more information on SAS Risk Management Solutions visit the software information page here. For more information on curated learnings paths on SAS Solutions and SAS Viya, visit the SAS Training page. You can also browse the catalog of SAS courses here.

Find more articles from SAS Global Enablement and Learning here.

Appendix

A1 - SAS Code for Gradient Boosting Model

/************************************************************************/
/* Setup CAS session and initialize required macro variables */
/************************************************************************/
/* Define a CAS engine libref for CAS in-memory data tables */
cas mySession sessopts=( timeout=1800 locale="en_US");
libname mycas cas caslib=casuser;

/* Specify the data set inputs and target */
%let class_inputs=reason job;
%let interval_inputs=clage clno debtinc loan mortdue value yoj derog delinqninq;
%let target=bad;

/* Specify a folder path to write the temporary output files */
%let outdir=&_SASWORKINGDIR;

/************************************************************************/
/* Load data (HMEQ) into CAS if needed. Data should already exist in */
/* CAS, and it will be loaded here if it does not exists in CAS */
/************************************************************************/
%if not %sysfunc(exist(&casdata)) %then %do;
proc casutil;
load data=sampsio.hmeq casout="hmeq" outcaslib=casuser;
run;
%end;

/************************************************************************/
/* Explore the data and plot missing values */
/************************************************************************/
proc cardinality data=mycas.hmeq outcard=mycas.data_card;
run;

proc print data=mycas.data_card(where=(_nmiss_>0));
title "Data Summary";
run;

data data_missing;
set mycas.data_card (where=(_nmiss_>0) keep=_varname_ _nmiss_ _nobs_);
_percentmiss_=(_nmiss_/_nobs_)*100;
label _percentmiss_='Percent Missing';
run;

proc sgplot data=data_missing;
title "Percentage of Missing Values";
vbar _varname_ / response=_percentmiss_ datalabel categoryorder=respdesc;
run;

title;

/************************************************************************/
/* Impute missing values */
/************************************************************************/
proc varimpute data=mycas.hmeq;
input clage /ctech=mean;
input delinq /ctech=median;
input ninq /ctech=random;
input debtinc yoj /ctech=value cvalues=50, 100;
output out=mycas.hmeq_prepped copyvars=(_ALL_);
code file="&outdir./impute_score.sas";
run;

/************************************************************************/
/* Setup and initialize some more macro variables */
/************************************************************************/
%let casdata=mycas.hmeq_prepped;
%let partitioned_data=mycas.hmeq_part;

/************************************************************************/
/* Partition the data into training and validation */
/************************************************************************/
proc partition data=&casdata partition samppct=70;
by &target;
output out=&partitioned_data copyvars=(_ALL_);
run;

/************************************************************************/
/* First GBM (GBM1) predictive model */
/************************************************************************/
/* ALL data used for training model */
proc gradboost data=&partitioned_data;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM1_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM1 model score code */
/************************************************************************/
data mycas._scored_GBM1;
set &partitioned_data;
%include "&outdir./GBM1_score.sas";
run;

/************************************************************************/
/* Assess model performance (GBM1) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM1(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM1_fitstat
rocinfo=GBM1_rocinfo
liftinfo=GBM1_liftinfo;
run;

ods exclude none;

/************************************************************************/
/* Second GBM (GBM2) predictive model */
/************************************************************************/
proc gradboost data=&partitioned_data;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM2_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM2 model score code */
/************************************************************************/
data mycas._scored_GBM2;
set &partitioned_data;
%include "&outdir./GBM2_score.sas";
run;

/************************************************************************/
/* Assess tree model performance (GBM2) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM2(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM2_fitstat
rocinfo=GBM2_rocinfo
liftinfo=GBM2_liftinfo;
run;

ods exclude none;

/************************************************************************/
/* Third GBM (GBM3) predictive model */
/************************************************************************/
proc gradboost data=&partitioned_data;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM3_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM3 model score code */
/************************************************************************/
data mycas._scored_GBM3;
set &partitioned_data;
%include "&outdir./GBM3_score.sas";
run;

/************************************************************************/
/* Assess tree model performance (GBM3) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM3(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM3_fitstat
rocinfo=GBM3_rocinfo
liftinfo=GBM3_liftinfo;
run;

ods exclude none;

/************************************************************************/
/* Fourth GBM (GBM4) predictive model */
/************************************************************************/
proc gradboost data=&partitioned_data;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM4_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM4 model score code */
/************************************************************************/
data mycas._scored_GBM4;
set &partitioned_data;
%include "&outdir./GBM4_score.sas";
run;

/************************************************************************/
/* Assess tree model performance (GBM4) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM4(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM4_fitstat
rocinfo=GBM4_rocinfo
liftinfo=GBM4_liftinfo;
run;

ods exclude none;

/*************************************************************************/
/* Create ROC and Lift plots (all models) using validation data */
/*************************************************************************/
ods graphics on;

data all_rocinfo;
set GBM1_rocinfo(in=g1) GBM2_rocinfo(in=g2) GBM3_rocinfo(in=g3)
GBM4_rocinfo(in=g4);
length model $ 16;

select;
when (g1) model='GBM1';
when (g2) model='GBM2';
when (g3) model='GBM3';
when (g4) model='GBM4';
end;
run;

data all_liftinfo;
set GBM1_liftinfo(in=g1) GBM2_liftinfo(in=g2) GBM3_liftinfo(in=g3)
GBM4_liftinfo(in=g4);
length model $ 16;

select;
when (g1) model='GBM1';
when (g2) model='GBM2';
when (g3) model='GBM3';
when (g4) model='GBM4';
end;
run;

/* Print AUC (Area Under the ROC Curve) */
title "AUC (using validation data)";

proc sql;
select distinct model, c from all_rocinfo;
quit;

/* Draw ROC charts */
proc sgplot data=all_rocinfo aspect=1;
title "ROC Curve (using validation data)";
xaxis values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05;
yaxis values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05;
lineparm x=0 y=0 slope=1 / transparency=.7;
series x=fpr y=sensitivity / group=model;
run;

/* Draw lift charts */
proc sgplot data=all_liftinfo;
title "Lift Chart (using validation data)";
yaxis label=' ' grid;
series x=depth y=lift / group=model markers markerattrs=(symbol=circlefilled);
run;

title;
ods graphics off;

A2 - SAS Code for Reproducible Gradient Boosting Model

/************************************************************************/
/* Setup CAS session and initialize required macro variables */
/************************************************************************/
/* Define a CAS engine libref for CAS in-memory data tables */
cas mySession sessopts=(timeout=1800 locale="en_US");
libname mycas cas caslib=casuser;

/* Specify the data set inputs and target */
%let class_inputs=reason job;
%let interval_inputs=clage clno debtinc loan mortdue value yoj derog delinqninq;
%let target=bad;

/* Specify a folder path to write the temporary output files */
%let outdir=&_SASWORKINGDIR;

/************************************************************************/
/* Load data (HMEQ) into CAS if needed. Data should already exist in */
/* CAS, and it will be loaded here if it does not exists in CAS */
/************************************************************************/
%if not %sysfunc(exist(mycas.hmeq)) %then %do;
proc casutil;
load data=sampsio.hmeq casout="hmeq" outcaslib=casuser replace;
run;
%end;

/************************************************************************/
/* Explore the data and plot missing values */
/************************************************************************/
proc cardinality data=mycas.hmeq outcard=mycas.data_card;
run;

proc print data=mycas.data_card(where=(_nmiss_>0));
title "Data Summary";
run;

data data_missing;
set mycas.data_card (where=(_nmiss_>0) keep=_varname_ _nmiss_ _nobs_);
_percentmiss_=(_nmiss_/_nobs_)*100;
label _percentmiss_='Percent Missing';
run;

proc sgplot data=data_missing;
title "Percentage of Missing Values";
vbar _varname_ / response=_percentmiss_ datalabel categoryorder=respdesc;
run;

title;

/************************************************************************/
/* Impute missing values */
/************************************************************************/
proc varimpute data=mycas.hmeq;
input clage /ctech=mean;
input delinq /ctech=median;
input ninq /ctech=random;
input debtinc yoj /ctech=value cvalues=50, 100;
output out=mycas.hmeq_prepped copyvars=(_ALL_);
code file="&outdir./impute_score.sas";
run;

/************************************************************************/
/* Setup and initialize some more macro variables */
/************************************************************************/
%let casdata=mycas.hmeq_prepped;
%let partitioned_data=mycas.hmeq_part;
%let partition2=mycas.hmeq_part2;

/************************************************************************/
/* Creating an ID variable to uniquely identify each row */
/************************************************************************/
data &casdata;
set &casdata;
id=_N_;
run;

/************************************************************************/
/* Partition the data into training and validation */
/************************************************************************/
proc partition data=&casdata partition samppct=70 seed=1234;
by &target;
output out=&partitioned_data copyvars=(_ALL_);
run;

/************************************************************************/
/* Using the Partition action in PROC CAS to order the data to support */
/* APPLYROWORDER option in PROC GRADBOOST */
/************************************************************************/
proc cas;
action table.partition /
table={name="hmeq_part",
groupby={"Job"},
orderby={"id"}
},
casout={name="hmeq_part2", replace=True};
run;

/************************************************************************/
/* First GBM (GBM1) predictive model */
/************************************************************************/
/* ALL data used for training model */
proc gradboost data=&partition2 applyroworder seed=123;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM1_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM1 model score code */
/************************************************************************/
data mycas._scored_GBM1;
set &partition2;
%include "&outdir./GBM1_score.sas";
run;

/************************************************************************/
/* Assess model performance (GBM1) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM1(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM1_fitstat
rocinfo=GBM1_rocinfo
liftinfo=GBM1_liftinfo;
run;

ods exclude none;

/************************************************************************/
/* Second GBM (GBM2) predictive model */
/************************************************************************/
proc gradboost data=&partition2 applyroworder seed=123;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM2_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM2 model score code */
/************************************************************************/
data mycas._scored_GBM2;
set &partition2;
%include "&outdir./GBM2_score.sas";
run;

/************************************************************************/
/* Assess tree model performance (GBM2) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM2(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM2_fitstat
rocinfo=GBM2_rocinfo
liftinfo=GBM2_liftinfo;
run;

ods exclude none;

/************************************************************************/
/* Third GBM (GBM3) predictive model */
/************************************************************************/
proc gradboost data=&partition2 applyroworder seed=123;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM3_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM3 model score code */
/************************************************************************/
data mycas._scored_GBM3;
set &partition2;
%include "&outdir./GBM3_score.sas";
run;

/************************************************************************/
/* Assess tree model performance (GBM3) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM3(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM3_fitstat
rocinfo=GBM3_rocinfo
liftinfo=GBM3_liftinfo;
run;

ods exclude none;

/************************************************************************/
/* Fourth GBM (GBM4) predictive model */
/************************************************************************/
proc gradboost data=&partition2 applyroworder seed=123;
input &interval_inputs / level=interval;
input &class_inputs / level=nominal;
target &target / level=nominal;
code file="&outdir./GBM4_score.sas";
run;

/************************************************************************/
/* Score the data using the generated GBM4 model score code */
/************************************************************************/
data mycas._scored_GBM4;
set &partition2;
%include "&outdir./GBM4_score.sas";
run;

/************************************************************************/
/* Assess tree model performance (GBM4) */
/************************************************************************/
ods exclude all;

proc assess data=mycas._scored_GBM4(where=(_partind_=0));
input p_&target.1;
target &target / level=nominal event='1';
fitstat pvar=p_&target.0/ pevent='0';
ods output fitstat=GBM4_fitstat
rocinfo=GBM4_rocinfo
liftinfo=GBM4_liftinfo;
run;

ods exclude none;

/*************************************************************************/
/* Create ROC and Lift plots (all models) using validation data */
/*************************************************************************/
ods graphics on;

data all_rocinfo2;
set GBM1_rocinfo(in=g1) GBM2_rocinfo(in=g2) GBM3_rocinfo(in=g3)
GBM4_rocinfo(in=g4);
length model $ 16;

select;
when (g1) model='GBM1';
when (g2) model='GBM2';
when (g3) model='GBM3';
when (g4) model='GBM4';
end;
run;

data all_liftinfo2;
set GBM1_liftinfo(in=g1) GBM2_liftinfo(in=g2) GBM3_liftinfo(in=g3)
GBM4_liftinfo(in=g4);
length model $ 16;

select;
when (g1) model='GBM1';
when (g2) model='GBM2';
when (g3) model='GBM3';
when (g4) model='GBM4';
end;
run;

/* Print AUC (Area Under the ROC Curve) */
title "AUC (using validation data)";

proc sql;
select distinct model, c from all_rocinfo2;
quit;

/* Draw ROC charts */
proc sgplot data=all_rocinfo2 aspect=1;
title "ROC Curve (using validation data)";
xaxis values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05;
yaxis values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05;
lineparm x=0 y=0 slope=1 / transparency=.7;
series x=fpr y=sensitivity / group=model;
run;

/* Draw lift charts */
proc sgplot data=all_liftinfo2;
title "Lift Chart (using validation data)";
yaxis label=' ' grid;
series x=depth y=lift / group=model markers markerattrs=(symbol=circlefilled);
run;

title;
ods graphics off;