Programming the statistical procedures from SAS

How to remove OUTLIERS in Base SAS?

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 75
Accepted Solution

How to remove OUTLIERS in Base SAS?

Hello everybody,

I would appreciate if you guys could give me the SAS logic to remove outliers in any given dataset.

For example, if we have a dataset with columns-

Employeeid , Salary, Bonus 

Scenario here would be a CEO and top management guys would be outliers and the opposites like cleaners and others  the lower end of the spectrum may also fall into outliers category.

I would like a dataset removing all the outliers. Any ideas in sas base? I'd appreciate any other suitable examples with the correct SAS logic.Thanks


Accepted Solutions
Solution
‎06-30-2017 01:50 PM
SAS Super FREQ
Posts: 3,547

Re: How to remove OUTLIERS in Base SAS?

My $0.02.  If you are using the data to form correlation, then this is really a bivariate problem, You want to remove outliers from the JOINT distribution of (salary, bonus).  See the article Detecting outliers in SAS: Part 3: Multivariate location and scatter - The DO Loop for some SAS code to do this.

Now I'm going to take off my statistician's hat and put on my "Git 'er done" hat.  I suggest that you create a scatter plot of salary vs bonus.  The scatter plot will show you if the variables are correlated and whether there are outliers in the data that need to be removed before you actually fire up PROC CORR to compute the magnitude of the correlation.

View solution in original post


All Replies
PROC Star
Posts: 1,132

Re: How to remove OUTLIERS in Base SAS?

Do you have business logic for what defines an outlier?

Some examples:

Top x and bottom x records;

Records below the 5th and above the 95th quantiles;

More than two or three standard deviations from the mean;

Any of these can be done easily, but a decision needs to be made on which.

Tom

SAS Super FREQ
Posts: 3,547

Re: How to remove OUTLIERS in Base SAS?

The first step is to use some statistics to identify outliers. For univariate data, PROC UNIVARIATE has several options for identifying outliers.

For details see the article Detecting outliers in SAS: Part 1: Estimating location - The DO Loop

and the follow-up article Detecting outliers in SAS: Part 2: Estimating scale - The DO Loop

(For both, scroll down to the PROC UNIVARIATE section).

After you've identified the outliers, you can use the DATA step to remove them. For example, you can use the ideas in the section  "The connection with outlier detection" (in Part 2) to identify and delete outliers whose robust z-scores are are more than 2.3 or 3.0.

Respected Advisor
Posts: 2,655

Re: How to remove OUTLIERS in Base SAS?

When working with data like salaries, be wary of the distribution.  Assuming a normal distribution is probably not a good idea, and so setting cutoffs via the standard deviation or z scores can be dangerous.  As much as I dislike saying this, you should probably test for normality first, using PROC UNIVARIATE. We use the Shapiro-Wilk statistic, as our datasets are usually not large.  If the test shows that there is no significant deviation from normality, then what Rick offers are the way to go.  If the data significantly differ from normal, then applying Tom's first two rules makes the most sense.

The question then arises--what are you going to do with the dataset with the outliers removed?  Calculate some measure of location and scale?  If so, there are better ways than removing outliers--Rick's links provide robust measures.  The median is robust to almost all distributional assumptions, and the IQR is nearly as robust as a measure of scale.

As a former instructor once said to me--Pay attention to the outliers.  They are the datapoints where something interesting is happening.

Steve Denham

Trusted Advisor
Posts: 2,114

Re: How to remove OUTLIERS in Base SAS?

Another issue may be identifiability.  Sometime the value of the salary can itself be an identifier.  We have used topcoding to minimize the identifiability yet still retaining much of the information.

Doc Muhlbaier

Duke

Frequent Contributor
Posts: 75

Re: How to remove OUTLIERS in Base SAS?

Well, The purpose of removing the outlier is to have clear determination of the correlation between the 2 variables, how the salary and Bonus rise, do they positively or inversely correlate. The outlier doesn;t really help in correlation graph output. Does that make sense?

Solution
‎06-30-2017 01:50 PM
SAS Super FREQ
Posts: 3,547

Re: How to remove OUTLIERS in Base SAS?

My $0.02.  If you are using the data to form correlation, then this is really a bivariate problem, You want to remove outliers from the JOINT distribution of (salary, bonus).  See the article Detecting outliers in SAS: Part 3: Multivariate location and scatter - The DO Loop for some SAS code to do this.

Now I'm going to take off my statistician's hat and put on my "Git 'er done" hat.  I suggest that you create a scatter plot of salary vs bonus.  The scatter plot will show you if the variables are correlated and whether there are outliers in the data that need to be removed before you actually fire up PROC CORR to compute the magnitude of the correlation.

Super User
Posts: 18,549

Re: How to remove OUTLIERS in Base SAS?

I'm with Rick, graph it first and use the graph to determine what your definition of an outlier may be.

Anscombe's quartet - Wikipedia, the free encyclopedia

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 19843 views
  • 3 likes
  • 6 in conversation