Build Bagging Trees with SAS® Programming

1 Like

Paper 1054-2021

Authors

Weibo Wang, Libo Lu, and Gary Liu, TELUS Communications Inc.

Abstract

Bootstrap aggregating (Bagging) has been widely recognized as an effective strategy for reducing the variance of prediction based on low-bias predictors, such as decision trees. In this paper, the authors show a straightforward implementation of bagging trees predictors with SAS® HPSPLIT procedure, and compare the performance against other procedures, such as SAS® HPFOREST procedure, as well as some popular open-source packages. This practice demonstrates a simple and easy way to effectively enhance performance of ML algorithms at a relatively low cost.

Watch the presentation

Watch Build Bagging Trees with SAS® Programming as presented by the authors on the SAS Users YouTube channel.

INTRODUCTION

With SAS® Viya, the FOREST procedure can do bagging for your data uncomplicatedly since it has the INBAGFRACTION option which specifies the number of observations to sample with replacement into the bagged data. Or if you use SAS® Enterprise Miner, HPFOREST procedure can build a Random Forest with bagging your data. If you don’t have either, this paper will be helpful to create Bagging Trees.

SAS/STAT® is a beloved software within Data Science community, providing comprehensive and powerful functionalities ranging from traditional statistical analysis of variance to advanced predictive analytics and visualizations. Procedures such as PROC STDIZE, PROC SURVEYSELECT, PROC FREQ are statistician’s favorite tools for daily data manipulation, whereas others such as PROC LOGISTIC, PROC LIFETEST, PROC NLMIXED have served as industry-wide standards for decades. SAS/STAT® 14.1 brings the analytic capabilities to a whole new level for better handling massive data and missing data challenges by new tools such as HPSPLIT Procedure.

Decision tree method has been a popular analytic approach for decades, thanks to its intuitive approach, interpretability, and flexibility with input data. However, tree methods are prone to overfitting and unstable performance due to various common issues. These minor disadvantages call for strategy for improvement, especially in the practical machine learning pipeline. Although the performance of decision trees could be sensitive to small change in the data, and the training of a single tree requires close attention to avoid overfitting, researchers have never become tired of inventing new ideas for improvement from various perspectives. Ensemble strategy is arguably the most favorable choice in the last 10 years, thanks to the rapid development of computational power.

Bootstrap aggregating, also known as Bagging, is a well-studied ensemble learning strategy to improve the stability and accuracy of relatively simpler modeling methods by reducing variance and thus reducing the risks of overfitting. Jeremy Keating (2017) has shared his experience of using PROC HPSPLIT for building hundreds of trees and aggregating prediction results, aiming at better prediction. However, that setting is fundamentally different from the commonly recognized setting of Random Forest, and the information revealed from the results in that webpage are limited.

To better understand the potential of the HPSPLIT procedure, the authors tested a few similar but different approaches based on decision trees, and compared their performances based on the Bank Marketing data available from the UCI Machine Learning Repository (Moro et al., 2014) at http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. (The data has 41188 records for direct marketing campaigns of a Portuguese banking institution. 70% of the data were randomly selected as training data and the other 30% of the data will be used to validate the result of each method.) The authors aim at providing a more comprehensive report on the modeling capabilities of The HPSPLIT procedure, which hopefully would resonate with peers who share their passions on this new procedure and SAS^® in general.

PART 1: BAGGING WITH PROC SURVEYSELECT

Bagging generates n new training sets, by sampling from the original training set uniformly and with replacement. By sampling with replacement, some observations may be repeated in each new training set. The SURVEYSELECT procedure has the URS (Unrestricted Random Sampling) option. The ‘Unrestricted’ method can help you to select records with replacement. The following macro the authors created is able to create n new training datasets from the original training data with replacement.

With the following macro, you can create n new data with the original dataset:

**************** Generate n new training sets. *******************;
%macro bagging(n=);
  %do i = 1 %to &n. ;
    data _null_;
      x=  ceil(uniform(today()*&i)*1000);
      call symputx('seed', x);
    run;

    proc surveyselect data=train_data out=bag_&i. 
    seed= &seed.
	method=urs 
	samprate=1 
	outhits; 
    run;
  %end;
%mend bagging;

**************** Create 5 new training sets. *******************;
%bagging(n=5);

The Following table shows the counts for the training datasets. Around 60% - 63% of the records were selected to create a new dataset. Some records were picked 6 - 8 times in a new training dataset. We can create multiple trees with these bagging data, and then use the results from each tree to calculate the average of the scores to reduce the variance.

	Number of Unique Customer	Max Number of Hits
Original Training Data	28655	1
Bagging Training Data 1	18215	7
Bagging Training Data 2	18085	7
Bagging Training Data 3	18113	8
Bagging Training Data 4	18030	7

Table 1. Counts of unique customers of the new datasets

PART 2: BAGGING TREES WITH PROC HPSPLIT

In the following part, the authors will demonstrate how to chip in HPSPLIT procedure after the SURVEYSELECT procedure to build Bagging trees.

Decision tree is available in SAS® by HPSPLIT procedure, which, simply put, is a truly brilliant procedure. The HPSPLIT procedure is a high-performance procedure that builds tree-based statistical models for classification and regression.

First of all, a folder is needed to be created to keep all the SAS® data step files generated by the code statement from each tree. And to use them to score other data is pretty simple, just use an include statement, the probability scores will be in the output data.

One of the advantages of Proc HPSPLIT is that we do not need to convert character variables into dummy variables. We can retrieve all variable names by using CONTENTS procedure and throw them all to HPSPLIT to select the most useful predictors based on variable importance. The following code is an example of simply generating macro variables for numeric and character variables.

**************** Retrieve variables. *******************;
proc contents data = train_data  noprint 
              out = data_info (keep = name varnum type);
run;

proc sql;
  select name into: c_var separated by ' '
  from data_info
  where type=2 and name ^= 'Y';
quit;

proc sql;
  select name into: n_var separated by ' '
  from data_info
  where type=1 and name ^='CUST_ID';
quit;

In Part 1, the authors have shown the macro for bagging training datasets. After adding the code as shown below into the do loop, each new training dataset will be used to build a tree.

**************** Bagging Trees. *******************;
%macro bagging(n=);
  %do i = 1 %to &n. ;
1	data _null_;
2	  x=  ceil(uniform(today()*&i)*1000);
3	  call symputx('seed', x);
4	run;

5	proc surveyselect data=train_data out=bag_&i. 
6	seed= &seed.
7	method=urs 
8	samprate=1 
9	outhits ; 
10	run;

11	proc hpsplit data=bag_&i. ;
12	  prune none;
13	  target y;
14	  input &character_var;
15	  input &numeric_var;
16	  output out=scored;
17	  code file="Yourpath\bag_&i..sas";
18	  output importance=var_imp_&i;
19	run;

20	data valid&i.;
21	set &test_data.;
22	  %include "&Code_path./bag_&i..sas";
23	run;

24	proc append base=valid data=valid&i. force; run;
25	proc append base=var_imp data=var_imp_&i. force; run;
  %end;
%mend bagging;

As shown above, The CODE statement (row no. 17) converts the final tree into SAS® DATA step code that can be used for scoring other data later. The code is written to the file that is specified by filename. We also include the validation data into the macro when building the trees (row no. 20 - 23). All the scoring data will be appended into the dataset ‘valid’ which we can use later to calculate the average probabilities of all trees for regression, or to vote the majority prediction of each tree for classification (row no. 24).

We also output the feature importance details and append the details from each tree into the dataset ‘var_imp’ (row no. 25). We can calculate the average importance of all the trees to do feature selection or to explain the model if necessary.

Before splitting the original dataset into train and test dataset, the authors created a new variable ‘CUST_ID’ with a unique number for each row which will be used as the key later to calculate the average score of all bagging results.

The following code shows how to score other data and to calculate the average score. With all the aggregated scores from all the bagging trees, we can calculate the average score for each record with the Key CUST_ID.

**************** Calculate Average score. *******************;
%macro scoring(n= , your_data=  );
%do i = 1 %to &n. ;
  data scoring&i.;
  set &your_data.;
    %include "Yourpath/bag_&i..sas";
  run;

  proc append base=scoring data=scoring&i. force; run;
%end;
%mend scoring;

%scoring(n= 20, your_data= test_data );

proc sql;
  create table average_score as
  select distinct Cust_ID, Y, count(*) as Counts, avg(P_YYE) as Ave_Score
  from scoring
  group by Cust_ID;
quit;

PART 3: COMPARISON WITH RANDOMFORESTCLASSIFIER

The HPSPLIT procedure builds tree-based algorithms for supervised learning. Unlike many open-source software, which requires all inputs to be numerical, this procedure naturally works with both categorical and continuous predictors, for both classification problems and regression problems. Compared with other typical statistical learning methods, advantages of decision trees include easy visualization and interpretation, as well as its capability of handling missing data.

For comparison, we built 2 decision trees using the HPSPLIT procedure with and without pruning the tree respectively. And build another 2 trees using python scikit-learn package, with pruning and without pruning as well. All 4 trees have the same max depth, max branch and same other settings.

Figure 1. Comparison of Single Trees from SAS® and Python SKLearn Figure 1. Comparison of Single Trees from SAS® and Python SKLearn

Figure 1 has the ROC and AUC from the 4 trees applied to the 30% validation data. We can see that the results from both tools are very close. The main difference came from whether or not the pruning strategy was applied, rather than which tool is used.

The following figure is the comparison of the results of Bagging trees from HPSPLIT with the results of Random Forest with Python. Random Forest is one of the most popular algorithms in machine learning and predictive modeling due to its great performance. It is a tree-based ensemble algorithm, which builds a number of single trees, from several to hundreds. Outputs from these trees are aggregated as the final results. To maximize the predictive power, minimizing similarities among trees is required, for example trees need to be as independent as possible from each other. The higher variances among trees mean better prediction. To help building distinct trees, bagging is used in random forests too. The fundamental difference between bagging and random forest is that in Random Forests, only a subset of features is selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.

Figure 2. Comparison of Bagging Trees from SAS® and Random Forest from SKLearn Figure 2. Comparison of Bagging Trees from SAS® and Random Forest from SKLearn

First of all, let’s look at the performance of a single tree with a pruning strategy applied. This is perhaps the simplest tree model that anyone could easily try, and having pretty decent performance almost immediately.

If we want to push for better performance by building a low-bias tree and averaging out the variance with bagging, we can see the strategy would work with a small number of bootstrap samples, like 5. And when we increase the bootstrap samples to a moderate number, like 20 or 50, the performance of our algorithm can be significantly improved from a single tree.

The rest methods are various Random Forests, in similar fashion as our bagging trees. As we can see, the performances are also very close in various scenarios.

Therefore, we have confidence that our bagging strategy can help you achieve best-in-class predictive performance in the computing environment that you are familiar with and with almost no extra cost.

CONCLUSION

Now we see that this method has a lot advantages: it does not request extra cost; it has great performance, not only predictive performance but also the ability of dealing with really large data; it is also easy to be implemented in real work.

But still, it has a disadvantage. If we have 1 million records and 1000 variables and 500 trees, it could take a while for a single core to complete the very first iteration. We could use multiple cores and leverage parallel computing to speed things up, which is one of the fundamental advantages of bagging. For this, we highly recommend SAS® Viya which will take care of Computational Resources for you. We could also use various variable selection approaches in order to come up with a better model for practical implementations.

For this scenario, what we usually do is to compare algorithms built on different numbers of trees and different numbers of features to find the optimal cut-off. For example, at the beginning, we use all the features to build several algorithms with different numbers of trees. Then use the top 200 common features from the previous models and repeat the development again. Then perhaps select the top 50 features from the previous step, and then iterate again, until any significant reduction of performance is observed. Among the algorithms with equivalent performance, we would prefer the one with simpler structure and less features, for better model interpretability and easy maintenance going forward.

The authors implemented a few mainstream ensemble strategies based on decision trees in SAS® and examined the performance of the algorithms. The comparison results seem to suggest:
1. The HPSPLIT procedure is a powerful and easy-to-use one-stop solution for developing decision trees algorithm that is user-friendly to both data scientists and analysts with less experience;
2. The default pruning strategy implemented in the HPSPLIT procedure offers an amazing capability in increasing prediction performance, which is a strong evidence that some traditional wisdom of modeling deserves more studies and practices;
3. Our example shows Bagging achieves better performance than pruning with simply 5 bootstrap samples, proving effectiveness of the strategy for making more accurate and smoother (granular) predictions. Bagging is a great ensemble method in case Random Forest procedure is not available in the reader’s computing environment. The cloud computing only makes the implementation easier than ever before;
4. The overall modeling capability and performance of SAS® remains its leadership among the data science industry.

References

[1]SAS Support Website. https://support.sas.com/documentation/onlinedoc/stat/141/hpsplit.pdf.

[2]SAS/STAT® 15.2 User's Guide. https://documentation.sas.com/?docsetId=statug&docsetTarget=titlepage.htm&docsetVersion=15.2&locale=....

[3]Breiman, L. 1996. “Bagging predictors.” Machine Learning, 26: 123–140.

[4]Breiman, L. 2001. “Random forests.” Machine Learning, 45: 5–32.

[5]Hastie, T., Tibshirani, R., and Friedman, J. 2016. The elements of statistical learning:

data mining, inference and prediction. 2nd ed. Springer Series in Statistics, Springer Verlag.

[6] Random Forest in Base SAS, Jeremy Keating. https://www.linkedin.com/pulse/random-forests-base-sas-jeremy-keating-bsc-fia/.

SAS Global Forum Proceedings 2021