Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Re: How to remove OUTLIERS in Base SAS?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 10-09-2012 01:02 PM
(32471 views)

Hello everybody,

I would appreciate if you guys could give me the SAS logic to remove outliers in any given dataset.

For example, if we have a dataset with columns-

Employeeid , Salary, Bonus

Scenario here would be a CEO and top management guys would be outliers and the opposites like cleaners and others the lower end of the spectrum may also fall into outliers category.

I would like a dataset removing all the outliers. Any ideas in sas base? I'd appreciate any other suitable examples with the correct SAS logic.Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

My $0.02. If you are using the data to form correlation, then this is really a bivariate problem, You want to remove outliers from the JOINT distribution of (salary, bonus). See the article Detecting outliers in SAS: Part 3: Multivariate location and scatter - The DO Loop for some SAS code to do this.

Now I'm going to take off my statistician's hat and put on my "Git 'er done" hat. I suggest that you create a scatter plot of salary vs bonus. The scatter plot will show you if the variables are correlated and whether there are outliers in the data that need to be removed before you actually fire up PROC CORR to compute the magnitude of the correlation.

7 REPLIES 7

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Do you have business logic for what defines an outlier?

Some examples:

Top x and bottom x records;

Records below the 5th and above the 95th quantiles;

More than two or three standard deviations from the mean;

Any of these can be done easily, but a decision needs to be made on which.

Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The first step is to use some statistics to identify outliers. For univariate data, PROC UNIVARIATE has several options for identifying outliers.

For details see the article Detecting outliers in SAS: Part 1: Estimating location - The DO Loop

and the follow-up article Detecting outliers in SAS: Part 2: Estimating scale - The DO Loop

(For both, scroll down to the PROC UNIVARIATE section).

After you've identified the outliers, you can use the DATA step to remove them. For example, you can use the ideas in the section "The connection with outlier detection" (in Part 2) to identify and delete outliers whose robust z-scores are are more than 2.3 or 3.0.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

When working with data like salaries, be wary of the distribution. Assuming a normal distribution is probably not a good idea, and so setting cutoffs via the standard deviation or z scores can be dangerous. As much as I dislike saying this, you should probably test for normality first, using PROC UNIVARIATE. We use the Shapiro-Wilk statistic, as our datasets are usually not large. If the test shows that there is no significant deviation from normality, then what Rick offers are the way to go. If the data significantly differ from normal, then applying Tom's first two rules makes the most sense.

The question then arises--what are you going to do with the dataset with the outliers removed? Calculate some measure of location and scale? If so, there are better ways than removing outliers--Rick's links provide robust measures. The median is robust to almost all distributional assumptions, and the IQR is nearly as robust as a measure of scale.

As a former instructor once said to me--Pay attention to the outliers. They are the datapoints where something interesting is happening.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Another issue may be identifiability. Sometime the value of the salary can itself be an identifier. We have used topcoding to minimize the identifiability yet still retaining much of the information.

Doc Muhlbaier

Duke

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

My $0.02. If you are using the data to form correlation, then this is really a bivariate problem, You want to remove outliers from the JOINT distribution of (salary, bonus). See the article Detecting outliers in SAS: Part 3: Multivariate location and scatter - The DO Loop for some SAS code to do this.

Now I'm going to take off my statistician's hat and put on my "Git 'er done" hat. I suggest that you create a scatter plot of salary vs bonus. The scatter plot will show you if the variables are correlated and whether there are outliers in the data that need to be removed before you actually fire up PROC CORR to compute the magnitude of the correlation.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.