BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MarkWik
Quartz | Level 8

Hello everybody,

I would appreciate if you guys could give me the SAS logic to remove outliers in any given dataset.

For example, if we have a dataset with columns-

Employeeid , Salary, Bonus 

Scenario here would be a CEO and top management guys would be outliers and the opposites like cleaners and others  the lower end of the spectrum may also fall into outliers category.

I would like a dataset removing all the outliers. Any ideas in sas base? I'd appreciate any other suitable examples with the correct SAS logic.Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

My $0.02.  If you are using the data to form correlation, then this is really a bivariate problem, You want to remove outliers from the JOINT distribution of (salary, bonus).  See the article Detecting outliers in SAS: Part 3: Multivariate location and scatter - The DO Loop for some SAS code to do this.

Now I'm going to take off my statistician's hat and put on my "Git 'er done" hat.  I suggest that you create a scatter plot of salary vs bonus.  The scatter plot will show you if the variables are correlated and whether there are outliers in the data that need to be removed before you actually fire up PROC CORR to compute the magnitude of the correlation.

View solution in original post

7 REPLIES 7
TomKari
Onyx | Level 15

Do you have business logic for what defines an outlier?

Some examples:

Top x and bottom x records;

Records below the 5th and above the 95th quantiles;

More than two or three standard deviations from the mean;

Any of these can be done easily, but a decision needs to be made on which.

Tom

Rick_SAS
SAS Super FREQ

The first step is to use some statistics to identify outliers. For univariate data, PROC UNIVARIATE has several options for identifying outliers.

For details see the article Detecting outliers in SAS: Part 1: Estimating location - The DO Loop

and the follow-up article Detecting outliers in SAS: Part 2: Estimating scale - The DO Loop

(For both, scroll down to the PROC UNIVARIATE section).

After you've identified the outliers, you can use the DATA step to remove them. For example, you can use the ideas in the section  "The connection with outlier detection" (in Part 2) to identify and delete outliers whose robust z-scores are are more than 2.3 or 3.0.

SteveDenham
Jade | Level 19

When working with data like salaries, be wary of the distribution.  Assuming a normal distribution is probably not a good idea, and so setting cutoffs via the standard deviation or z scores can be dangerous.  As much as I dislike saying this, you should probably test for normality first, using PROC UNIVARIATE. We use the Shapiro-Wilk statistic, as our datasets are usually not large.  If the test shows that there is no significant deviation from normality, then what Rick offers are the way to go.  If the data significantly differ from normal, then applying Tom's first two rules makes the most sense.

The question then arises--what are you going to do with the dataset with the outliers removed?  Calculate some measure of location and scale?  If so, there are better ways than removing outliers--Rick's links provide robust measures.  The median is robust to almost all distributional assumptions, and the IQR is nearly as robust as a measure of scale.

As a former instructor once said to me--Pay attention to the outliers.  They are the datapoints where something interesting is happening.

Steve Denham

Doc_Duke
Rhodochrosite | Level 12

Another issue may be identifiability.  Sometime the value of the salary can itself be an identifier.  We have used topcoding to minimize the identifiability yet still retaining much of the information.

Doc Muhlbaier

Duke

MarkWik
Quartz | Level 8

Well, The purpose of removing the outlier is to have clear determination of the correlation between the 2 variables, how the salary and Bonus rise, do they positively or inversely correlate. The outlier doesn;t really help in correlation graph output. Does that make sense?

Rick_SAS
SAS Super FREQ

My $0.02.  If you are using the data to form correlation, then this is really a bivariate problem, You want to remove outliers from the JOINT distribution of (salary, bonus).  See the article Detecting outliers in SAS: Part 3: Multivariate location and scatter - The DO Loop for some SAS code to do this.

Now I'm going to take off my statistician's hat and put on my "Git 'er done" hat.  I suggest that you create a scatter plot of salary vs bonus.  The scatter plot will show you if the variables are correlated and whether there are outliers in the data that need to be removed before you actually fire up PROC CORR to compute the magnitude of the correlation.

Reeza
Super User

I'm with Rick, graph it first and use the graph to determine what your definition of an outlier may be.

Anscombe's quartet - Wikipedia, the free encyclopedia

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 32126 views
  • 4 likes
  • 6 in conversation