Any datapoint with an unusual or unexpected value can be considered an outlier. Outliers typically cause problems with any statistical analysis. But with the advent of machine learning techniques and larger datasets, do we still need to worry about outliers? The answer to this question will depend on the machine learning algorithm that is being used. In a previous post, I discussed how outliers might affect a neural network model. In this post we will explore the affect outliers have on a tree model. Specifically, we will be looking at a regression tree meaning that we will use a tree model to predict a continuous target variable.
To see how outliers might affect a regression tree, we will generate some data that follows an s-shaped curve. The data will have some error associated with it so that we would not expect any model to provide a perfect prediction, but there will only be 19 observations. These data will be group 0. Now to this shape, one of three outliers will be added to the data. The first outlier type is an outlier that follows the same trend as the rest of the data, but has a large x-value. This point will be the diamond in the plot of the data and is group 1. The second outlier type will be an observation that does not follow the trend of the data and also has a large x-value. This second type is the square in the plot and is identified as group 2. Finally, the third outlier occurs in the middle of the x-space and is far off the curve of the rest of the data. This point is identified by the triangle in the plot and is group 3.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The code used to generate this data is below.
/* Creating the Original 19 Observations */
data casuser.curve;
call streaminit(8675309);
do i=1 to 19;
x=rand('uniform')*2;
error=rand('uniform')*0.025;
group=0;
y=0.2 + 0.6*x - 0.86*x**2 + 0.28*x**3 + error;
output;
end;
run;
/* Creating the Three Different Outlier Datasets that will be merged with the original data */
data casuser.outlier1; /* Outlier that follows the same trend as the data */
x=2.1;
error=rand('uniform')*0.025;
y=0.2 + 0.6*x - 0.86*x**2 + 0.28*x**3 + error;
group=1;
run;
data casuser.outlier2; /* Outlier that does NOT follow the trend of the data */
x=2.1;
error=rand('uniform')*0.025;
y=0.1 + error;
group=2;
run;
data casuser.outlier3; /* Outlier in the middle of the x-space and does not follow data trend */
x=1;
error=rand('uniform')*0.025;
y=0.45;
group=3;
run;
Now that we have some data we can start fitting some models.
I will use PROC HPSPLIT with all its defaults to build the models, but with a limited depth of 2 layers. Without a validation set, a tree will continue to split until the number of leaves is close to or equal to the number of observations. I want to limit the depth and see how the observations are distributed among the limited number of leaves. This code will limit the depth of the tree and because pruning is turned off, every tree will have a depth of 2 with a total of four leaves.
* Fitting the Model to the Original Data;
proc hpsplit data=casuser.curve minleafsize=1 maxdepth=2;
id x;
model y=x;
prune off;
output out=hpsplout;
run;
proc print data=hpsplout;
run;
*Plotting the Raw Data and Overlaying Predicted Values;
proc sgplot data=hpsplout;
scatter X=x Y=y;
scatter X=x Y=P_y / markerattrs=(symbol=asterisk color=red);
run;
Although I have specified the minimum leaf size of 1, that is the default for PROC HPSPLIT. We use the ID statement to ensure that the group number and the x values are stored along with the predictions in the hpsplout file. The pruning is turned off so that every tree model will have two layers with four leaves. Finally, we plot the data along with the predictions as red asterisks using PROC SGPLOT. The PROC HPSPLIT output indicates that there are 4 leaves of the final tree and 19 observations were used.
The tree structure shows the four final leaves. For our purposes here, notice that number of observations in each leaf. They are 5, 5, 1, and 8. So other than the one leaf, which is node 5, the 19 observations are pretty close to evenly distributed. This can also be seen in the plot of the data overlaid with the predictions.
Now that we see how a regression tree fits our pattern we will add an outlier. Let’s start by adding an outlier that has a larger X value, but still follows the trend in the data, outlier 1 that was mentioned earlier.
/* Creating and Analyzing the Data with an Outlier that Follows the Trend */
title1 'Original Data Plus an Outlier that Follows the Trend';
data casuser.followtrend;
set casuser.curve casuser.outlier1;
proc print data=casuser.followtrend;
run;
ods graphics / attrpriority=none;
* Plotting the Original Data and the Outlier;
proc sgplot data=casuser.followtrend;
styleattrs datasymbols=(CircleFilled DiamondFilled);
scatter X=x Y=y / group=group;
run;
ods graphics / attrpriority=color;
* Fitting the Model with Outlier;
proc hpsplit data=casuser.followtrend minleafsize=1;
id group x;
model y=x;
prune off;
output out=hpsplout;
run;
proc print data=hpsplout;
run;
* Plotting Data with Overlaid Predictions;
proc sgplot data=hpsplout;
scatter X=x Y=y / group=group;
scatter X=x Y=P_y / markerattrs=(symbol=asterisk color=red);;
run;
Only a portion of the output is shown here, but you can see that the model has a node with only one observation. Looking at the plot of the data overlaid with the predictions, we can see that the outlier is the point that was split off on its own. By splitting that outlier into its own leaf node, the outlier does not affect the model much at all.
Perhaps if we had outlier number 2, the one with a large X value that does NOT follow the trend of the rest of the data the results will be different. The code for fitting a model to that data and some of the selected output is below.
/* Creating/Analyzing Data with an Outlier that does NOT follow the trend */
title1 'Original Data Plus an Outlier that Does NOT Follow Trend';
data casuser.offtrend;
set casuser.curve casuser.outlier2;
ods graphics / aatrpriority=none;
proc sgplot data=casuser.offtrend;
styleattrs datasymbols=(CircleFilled SquareFilled);
scatter X=x Y=y / group=group;
run;
ods graphics / attrpriority=color;
proc hpsplit data=casuser.offtrend minleafsize=1 maxdepth=2;
id group x;
model y=x;
prune off;
output out=hpsplout;
run;
proc print data=hpsplout;
run;
proc sgplot data=hpsplout;
scatter X=x Y=y / group=group;
scatter X=x Y=P_y / markerattrs=(symbol=asterisk color=red);
run;
The model also has a leaf node with only one observation. That one leaf node contains the observation in the middle of the X space. So this model seems very similar to the original data with no outlier. In fact, the presence of the outlier in this situation does not affect the model much at all either.
One last situation to consider is the outlier 3 case, where the outlier is in the middle of the X space. As before, here is the code for this situation and some selected output.
/* Creating and Analyzing Data with an Outlier that is in the Middle of the X Space */
title1 'Original Data Plus an Outlier in Middle of X Space';
data casuser.midx;
set casuser.curve casuser.outlier3;
ods graphics / attrpriority=none;
proc sgplot data=casuser.midx;
styleattrs datasymbols=(CircleFilled TriangleFilled);
scatter X=x Y=y / group=group;
run;
ods graphics / attrpriority=color;
proc hpsplit data=casuser.midx minleafsize=1 maxdepth=2;
id group x;
model y=x;
prune off;
output out=hpsplout;
run;
proc print data=hpsplout;
run;
proc sgplot data=hpsplout;
scatter X=x Y=y / group=group;
scatter X=x Y=P_y / markerattrs=(symbol=asterisk color=red);
run;
For this situation we have a model with two different leaf nodes that each contain only a single observation. Looking at the plot, we see that the observation in the middle of the X space that follows the trend is on its own in a leaf node. The other observation in its own leaf node is the outlier. The end result of this model is also quite similar to the model with all of the original data. The outlier is “quarantined” in its own node which will limit its influence on the model.
So ultimately, a tree model is quite robust to outliers. The outliers will often be included in a node of their own which limits the influence on the remainder of the model. However, be aware that if one changes the minimum leaf size away from one, outliers will be more likely to have an influence on the model as the outlier will be forced to be combined with non-outlying observations.
Find more articles from SAS Global Enablement and Learning here.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.