turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- How to run statistics that exclude outliers/extrem...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-19-2015 11:33 PM

Several continuous variables in my data have outliers/extreme values. Instead of assigning them missing values, I wonder if there's a more non-invasive approach, i.e., to leave outliers untouched but run statistics that exclude them. For example, how can I run means of a variable that excludes 5 observations with lowest values, and maybe 5 others highest values?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-19-2015 11:52 PM

Look at winsorized means and trimmed means in **proc univariate**.

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 12:25 AM

Thanks for your hint. I review these options and trimmed means looks like what I need.

There's another related question: It seems that trimmed option in PROC UNIVARIATE equally cuts off two tails of the variables. But is it possible to exclude only one tail? For example, there are cases that distributions of outliers are skewed (or only present) on one end, either left or right. Hence, sometimes we wish to put restriction on just one tail of the distribution.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 08:19 AM

As far as I know, there are no one-sided versions of trimmed or Winsorized means implemented in SAS. Of course, they can be calculated fairly easily: In the case of one-sided trimmed means a simple WHERE condition can exclude the extreme observations, after a suitable cutoff point has been determined. The question remains (not only in the one-sided case) where to put the cutoff point. The answer will depend on the statistical model used. As you say, your data includes several variables, which makes it more difficult, because it is not straightforward how to define extremeness for multivariate data. Please note that an element of a multivariate sample could be an outlier without having an extreme value in any of its components.

The classic standard reference on the subject, "Outliers in Statistical Data", includes a section on "Accomodation of outliers in gamma (including exponential) samples" (p. 174 ff.) where various proposals for this univariate, asymmetric setting are discussed, including the one-sided trimmed mean. A corresponding section reviews discordancy tests for this class of models (p. 193 ff.). Part III of that book devotes more than 100 pages to multivariate and structured data.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 12:07 PM

Symmetric trimming or winsorizing might seem inefficient. Why sacrifice or alter some of your *good* data because of some other *bad* data? But if you can make the assumption that the *bad* data are nevertheless on the correct side of the distribution (e.g. very very large values actually represent large values) then removing them will systematically shift the location of the distribution. That's why it is safer to perform trimming or winsorization symmetrically.

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 08:42 AM

The simplest mechanism is to change from using classical estimators (like the mean and standard deviation) to robust estimators (like the median and inter-quartile range. By using a robust estimator, you avoid making distributional assumptions about your data.

PROC UNIVARIATE and other SAS procedures support robust estimates of location and scale. See:

Detecting outliers in SAS: Part 1: Estimating location

Detecting outliers in SAS: Part 2: Estimating scale

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 09:07 AM

Be careful. You are in dangerous territory. Your idea of deleting only the largest observations (or the smallest, but not both) can lead to arbitrary results. Rick has a better suggestion: use various robust statistical methods that deal with extreme observations in an appropriate manner. I highly recommend the quantile regression procedure. Primarily used for relationship between variables, it can be used for many other robust statistical estimates. I think Rick has a blog post on this.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 09:44 AM

I think LVM might be refering to this post about how to use the ROBUSTREG procedure to compute robust estimates of univariate quantities.

To reiterate LVM's point: Excluding data is a slippery slope. I like to say that we can use robust methods to IDENTIFY outliers and to construct estimates that are ROBUST to outliers, but I rarely use the phrase "exclude outliers."

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 06:31 PM

You've gotten some good advice regarding MEANS. Especially the advice to BE CAREFUL.

In your question, though, you used MEANS only as an example. If you want to do regression, I suggest either QUANTREG or ROBUSTREG.

Oddly enough, I wrote a paper on these. Should more of your PROC REGs be QUANTREGs and ROBUSTREGs?

In addition, note that the outliers are often the interesting stuff - automatically excluding outliers would mean that we never discovered black holes. :-).