BookmarkSubscribeRSS Feed

Outliers and Neural Network Models: Should We Be Concerned?

Started ‎10-03-2024 by
Modified ‎10-03-2024 by
Views 525

 

Outliers have always been an issue for data analysis. Much effort has been spent on identifying outliers and remedying their effects on data analysis. But with the advent of machine learning techniques, do we still need to worry about outliers? The purpose of this post is to discuss a few issues regarding outliers and their effects on neural network models.

 

Let’s start with a small dataset and a known relationship. Doing this will ensure that we can create a true outlier. The following code will create a 19 run dataset with a random x variable for an input that will range uniformly from 0 to 2, an error term from a normal distribution with mean 0 and a standard deviation of 0.02, and a y variable that is based on the function y=0.2 + 0.6*x – 0.86*x**2 + 0.28*x**3 + error.

 

data curve;
call streaminit(8675309);
do i=1 to 19;
	x=rand('uniform')*2;
	error=rand('normal')*0.02;
	y=0.2 + 0.6*x - 0.86*x**2 + 0.28*x**3 + error;
	output;
end;
run;
title1 'Plot of Original Data';
proc sgplot data=curve;
	scatter X=x Y=y;
run;

 

Because the error term is small, a plot of the data shows the relationship.

 

01_daober-June-Figure-1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

This data will be modified with a single outlier. But not all outliers are the same. So we will add that single outlier in three different ways. The first outlier will be at an x value of 2.1, which is higher than any other x value, but the y value will follow the same formula as the original data. Outlier number 2 will have an x value of 2.1, but the y value will NOT follow the rest of the data trend. The y value will be much lower than the trend. And finally, outlier number 3 will have an x value of 1.0, which is right in the middle of the x range, but the y value will be much larger than the formula would provide. This code will generate the three outliers which will each be joined to the original 19 points.

 

data outlier1; /* Outlier that follows the same trend as the data */
	x=2.1;
	error=rand('uniform')*0.025;
	y=0.2 + 0.6*x - 0.86*x**2 + 0.28*x**3 + error;
	output;
run;
data outlier2; /* Outlier that does NOT follow the trend of the data */
	x=2.1;
	error=rand('uniform')*0.025;
	y=0.1 + error;
run;
data outlier3; /* Outlier in the middle of the x-range and does not follow data trend */
	x=1;
	y=0.45;
run;

 

Here are the plots showing these outliers with the original data.

  

combo234_daober-June-Figure-2.png

 

How would a neural network handle building a model with each of these scenarios? And would the fact that these are small datasets cause a problem? Let’s try fitting a neural network model to each of these situations to see if the outlier causes the algorithm any problems. This is not a large problem, so I will use PROC NEURAL with 5 hidden units in the hidden layer and many of the default options.

 

title1 'Neural Network when Outlier Follows the Trend';
proc dmdb data=followtrend dmdbcat=catlog;
	var x y;
run;
proc neural data=followtrend dmdbcat=catlog;
	input x / level=interval id=inputs;
	hidden 5 / id=HU1;
	target y / level=interval id=target;
	connect inputs HU1;
	connect HU1 target;
	train 
		maxiter=100 
		estiter=1 
		outest=weights
		outfit=stats
		out=Predictions;
quit;
run;

 

I won’t bother looking at the fitting results. Instead, I will just focus on a graph that plots x versus y overlaid with the neural network predictions (in red with filled dots).

 

05_daober-June-Figure-5.png

 

As expected, this neural network seems to fit the data well, and can predict the outlier quite well. So in this situation, the outlier does not seem to have any impact on the fitting results of this neural network.

 

Let’s move on to the next situation where the x value is 2.1, but the y value does not follow the rest of the data trend. As before, we will not worry about the actual fitting results and just focus on the graph of the results.

 

06_daober-June-Figure-6.png

 

Here we see that the neural network model is modified by the outlier. Neural networks are very flexible, and therefore, may be more sensitive to outliers. They may be reacting to noise. For this situation, remember that 5% of the data is made up by this outlier. Later we will see what happens when we have a larger dataset for the neural network.

 

Let’s visit our last scenario now with the outlier in the middle of the x range.

 

07_daober-June-Figure-7.png

 

Even with the outlier being “surrounded” by similar x value datapoints, the neural network model is significantly affected by the outlier. From these scenarios, it seems as though a neural network model is sensitive to outliers. But as mentioned earlier, perhaps this is really due to such a small dataset. Machine learning models typically work better with more data. Let’s try the same situation, but the original data will have 1999 observations rather than 19. This means that the outlier will only be 0.05% of the total dataset. Following the same approach, here are the three neural network models for these three scenarios.

 

08_daober-June-Figure-8.png

 

As expected, when the outlier follows the trend, the neural network follows the signal in the data quite well.

 

09_daober-June-Figure-9.png

 

When the outlier does not follow the trend, the neural network model still looks pretty good. It looks like the outlier does not really impact the results at all, but if we zoom in on the area where this outlier is, the story is slightly different.

 

10_daober-June-Figure-10.png

 

When you zoom in, you can see that the neural network model is actually still biased because of this one outlier (which is in the lower right on the graph). There are more data points above the neural network prediction line than below it as the x value increases. The results may not be too bad, but even with only one outlier out of the 2000 observations, the model is affected.

 

You may not be surprised by this result. After all, the x value of 2.1 will have pretty high leverage since it is a larger x value than any other observation. Let’s look at the third scenario to see what happens.

 

11_daober-June-Figure-11.png

 

Finally, this looks pretty good. The outlier is at the very top of the graph, but none of the neural network predictions seem to be affected by that point. Let’s zoom in around 0.95 to 1.05 to just make sure there is not a hidden problem.

 

12_daober-June-Figure-12.png

 

Even when zoomed in, there does not seem to be a problem with the neural network model.

 

To summarize, outliers can be a problem with a neural network model. This is especially true if the dataset is small. Even with a larger dataset, if the outlier is an outlier in the x space AND the point does not follow the trend of the rest of the data, the neural network model will still likely be affected. If there are other observations in the input space near that outlier, the impact on the results will likely be minimal. Ultimately, even with a neural network model, care should be taken to identify and possibly mitigate outliers. Some possible actions are to examine the data or using a more robust error function for the neural network. In a future post I will look into the impact outliers have on a decision tree.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎10-03-2024 09:31 PM
Updated by:
Contributors

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Labels
Article Tags