10-09-2012 01:02 PM
I would appreciate if you guys could give me the SAS logic to remove outliers in any given dataset.
For example, if we have a dataset with columns-
Employeeid , Salary, Bonus
Scenario here would be a CEO and top management guys would be outliers and the opposites like cleaners and others the lower end of the spectrum may also fall into outliers category.
I would like a dataset removing all the outliers. Any ideas in sas base? I'd appreciate any other suitable examples with the correct SAS logic.Thanks
10-09-2012 01:25 PM
Do you have business logic for what defines an outlier?
Top x and bottom x records;
Records below the 5th and above the 95th quantiles;
More than two or three standard deviations from the mean;
Any of these can be done easily, but a decision needs to be made on which.
10-09-2012 01:41 PM
The first step is to use some statistics to identify outliers. For univariate data, PROC UNIVARIATE has several options for identifying outliers.
For details see the article Detecting outliers in SAS: Part 1: Estimating location - The DO Loop
and the follow-up article Detecting outliers in SAS: Part 2: Estimating scale - The DO Loop
(For both, scroll down to the PROC UNIVARIATE section).
After you've identified the outliers, you can use the DATA step to remove them. For example, you can use the ideas in the section "The connection with outlier detection" (in Part 2) to identify and delete outliers whose robust z-scores are are more than 2.3 or 3.0.
10-09-2012 02:41 PM
When working with data like salaries, be wary of the distribution. Assuming a normal distribution is probably not a good idea, and so setting cutoffs via the standard deviation or z scores can be dangerous. As much as I dislike saying this, you should probably test for normality first, using PROC UNIVARIATE. We use the Shapiro-Wilk statistic, as our datasets are usually not large. If the test shows that there is no significant deviation from normality, then what Rick offers are the way to go. If the data significantly differ from normal, then applying Tom's first two rules makes the most sense.
The question then arises--what are you going to do with the dataset with the outliers removed? Calculate some measure of location and scale? If so, there are better ways than removing outliers--Rick's links provide robust measures. The median is robust to almost all distributional assumptions, and the IQR is nearly as robust as a measure of scale.
As a former instructor once said to me--Pay attention to the outliers. They are the datapoints where something interesting is happening.
10-10-2012 07:40 AM
Another issue may be identifiability. Sometime the value of the salary can itself be an identifier. We have used topcoding to minimize the identifiability yet still retaining much of the information.
10-10-2012 12:54 PM
Well, The purpose of removing the outlier is to have clear determination of the correlation between the 2 variables, how the salary and Bonus rise, do they positively or inversely correlate. The outlier doesn;t really help in correlation graph output. Does that make sense?
10-10-2012 01:11 PM
My $0.02. If you are using the data to form correlation, then this is really a bivariate problem, You want to remove outliers from the JOINT distribution of (salary, bonus). See the article Detecting outliers in SAS: Part 3: Multivariate location and scatter - The DO Loop for some SAS code to do this.
Now I'm going to take off my statistician's hat and put on my "Git 'er done" hat. I suggest that you create a scatter plot of salary vs bonus. The scatter plot will show you if the variables are correlated and whether there are outliers in the data that need to be removed before you actually fire up PROC CORR to compute the magnitude of the correlation.