Outliers have always been an issue for data analysis. Much effort has been spent on identifying outliers and remedying their effects on data analysis. But with the advent of machine learning techniques, do we still need to worry about outliers? The purpose of this post is to discuss a few issues regarding outliers and their effects on neural network models.
Let’s start with a small dataset and a known relationship. Doing this will ensure that we can create a true outlier. The following code will create a 19 run dataset with a random x variable for an input that will range uniformly from 0 to 2, an error term from a normal distribution with mean 0 and a standard deviation of 0.02, and a y variable that is based on the function y=0.2 + 0.6*x – 0.86*x**2 + 0.28*x**3 + error.
data curve;
call streaminit(8675309);
do i=1 to 19;
x=rand('uniform')*2;
error=rand('normal')*0.02;
y=0.2 + 0.6*x - 0.86*x**2 + 0.28*x**3 + error;
output;
end;
run;
title1 'Plot of Original Data';
proc sgplot data=curve;
scatter X=x Y=y;
run;
Because the error term is small, a plot of the data shows the relationship.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
This data will be modified with a single outlier. But not all outliers are the same. So we will add that single outlier in three different ways. The first outlier will be at an x value of 2.1, which is higher than any other x value, but the y value will follow the same formula as the original data. Outlier number 2 will have an x value of 2.1, but the y value will NOT follow the rest of the data trend. The y value will be much lower than the trend. And finally, outlier number 3 will have an x value of 1.0, which is right in the middle of the x range, but the y value will be much larger than the formula would provide. This code will generate the three outliers which will each be joined to the original 19 points.
data outlier1; /* Outlier that follows the same trend as the data */
x=2.1;
error=rand('uniform')*0.025;
y=0.2 + 0.6*x - 0.86*x**2 + 0.28*x**3 + error;
output;
run;
data outlier2; /* Outlier that does NOT follow the trend of the data */
x=2.1;
error=rand('uniform')*0.025;
y=0.1 + error;
run;
data outlier3; /* Outlier in the middle of the x-range and does not follow data trend */
x=1;
y=0.45;
run;
Here are the plots showing these outliers with the original data.
How would a neural network handle building a model with each of these scenarios? And would the fact that these are small datasets cause a problem? Let’s try fitting a neural network model to each of these situations to see if the outlier causes the algorithm any problems. This is not a large problem, so I will use PROC NEURAL with 5 hidden units in the hidden layer and many of the default options.
title1 'Neural Network when Outlier Follows the Trend';
proc dmdb data=followtrend dmdbcat=catlog;
var x y;
run;
proc neural data=followtrend dmdbcat=catlog;
input x / level=interval id=inputs;
hidden 5 / id=HU1;
target y / level=interval id=target;
connect inputs HU1;
connect HU1 target;
train
maxiter=100
estiter=1
outest=weights
outfit=stats
out=Predictions;
quit;
run;
I won’t bother looking at the fitting results. Instead, I will just focus on a graph that plots x versus y overlaid with the neural network predictions (in red with filled dots).
As expected, this neural network seems to fit the data well, and can predict the outlier quite well. So in this situation, the outlier does not seem to have any impact on the fitting results of this neural network.
Let’s move on to the next situation where the x value is 2.1, but the y value does not follow the rest of the data trend. As before, we will not worry about the actual fitting results and just focus on the graph of the results.
Here we see that the neural network model is modified by the outlier. Neural networks are very flexible, and therefore, may be more sensitive to outliers. They may be reacting to noise. For this situation, remember that 5% of the data is made up by this outlier. Later we will see what happens when we have a larger dataset for the neural network.
Let’s visit our last scenario now with the outlier in the middle of the x range.
Even with the outlier being “surrounded” by similar x value datapoints, the neural network model is significantly affected by the outlier. From these scenarios, it seems as though a neural network model is sensitive to outliers. But as mentioned earlier, perhaps this is really due to such a small dataset. Machine learning models typically work better with more data. Let’s try the same situation, but the original data will have 1999 observations rather than 19. This means that the outlier will only be 0.05% of the total dataset. Following the same approach, here are the three neural network models for these three scenarios.
As expected, when the outlier follows the trend, the neural network follows the signal in the data quite well.
When the outlier does not follow the trend, the neural network model still looks pretty good. It looks like the outlier does not really impact the results at all, but if we zoom in on the area where this outlier is, the story is slightly different.
When you zoom in, you can see that the neural network model is actually still biased because of this one outlier (which is in the lower right on the graph). There are more data points above the neural network prediction line than below it as the x value increases. The results may not be too bad, but even with only one outlier out of the 2000 observations, the model is affected.
You may not be surprised by this result. After all, the x value of 2.1 will have pretty high leverage since it is a larger x value than any other observation. Let’s look at the third scenario to see what happens.
Finally, this looks pretty good. The outlier is at the very top of the graph, but none of the neural network predictions seem to be affected by that point. Let’s zoom in around 0.95 to 1.05 to just make sure there is not a hidden problem.
Even when zoomed in, there does not seem to be a problem with the neural network model.
To summarize, outliers can be a problem with a neural network model. This is especially true if the dataset is small. Even with a larger dataset, if the outlier is an outlier in the x space AND the point does not follow the trend of the rest of the data, the neural network model will still likely be affected. If there are other observations in the input space near that outlier, the impact on the results will likely be minimal. Ultimately, even with a neural network model, care should be taken to identify and possibly mitigate outliers. Some possible actions are to examine the data or using a more robust error function for the neural network. In a future post I will look into the impact outliers have on a decision tree.
Find more articles from SAS Global Enablement and Learning here.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.