11-19-2015 11:33 PM
Several continuous variables in my data have outliers/extreme values. Instead of assigning them missing values, I wonder if there's a more non-invasive approach, i.e., to leave outliers untouched but run statistics that exclude them. For example, how can I run means of a variable that excludes 5 observations with lowest values, and maybe 5 others highest values?
11-20-2015 12:25 AM
Thanks for your hint. I review these options and trimmed means looks like what I need.
There's another related question: It seems that trimmed option in PROC UNIVARIATE equally cuts off two tails of the variables. But is it possible to exclude only one tail? For example, there are cases that distributions of outliers are skewed (or only present) on one end, either left or right. Hence, sometimes we wish to put restriction on just one tail of the distribution.
11-20-2015 08:19 AM
As far as I know, there are no one-sided versions of trimmed or Winsorized means implemented in SAS. Of course, they can be calculated fairly easily: In the case of one-sided trimmed means a simple WHERE condition can exclude the extreme observations, after a suitable cutoff point has been determined. The question remains (not only in the one-sided case) where to put the cutoff point. The answer will depend on the statistical model used. As you say, your data includes several variables, which makes it more difficult, because it is not straightforward how to define extremeness for multivariate data. Please note that an element of a multivariate sample could be an outlier without having an extreme value in any of its components.
The classic standard reference on the subject, "Outliers in Statistical Data", includes a section on "Accomodation of outliers in gamma (including exponential) samples" (p. 174 ff.) where various proposals for this univariate, asymmetric setting are discussed, including the one-sided trimmed mean. A corresponding section reviews discordancy tests for this class of models (p. 193 ff.). Part III of that book devotes more than 100 pages to multivariate and structured data.
11-20-2015 12:07 PM
Symmetric trimming or winsorizing might seem inefficient. Why sacrifice or alter some of your good data because of some other bad data? But if you can make the assumption that the bad data are nevertheless on the correct side of the distribution (e.g. very very large values actually represent large values) then removing them will systematically shift the location of the distribution. That's why it is safer to perform trimming or winsorization symmetrically.
11-20-2015 08:42 AM
The simplest mechanism is to change from using classical estimators (like the mean and standard deviation) to robust estimators (like the median and inter-quartile range. By using a robust estimator, you avoid making distributional assumptions about your data.
PROC UNIVARIATE and other SAS procedures support robust estimates of location and scale. See:
11-20-2015 09:07 AM
Be careful. You are in dangerous territory. Your idea of deleting only the largest observations (or the smallest, but not both) can lead to arbitrary results. Rick has a better suggestion: use various robust statistical methods that deal with extreme observations in an appropriate manner. I highly recommend the quantile regression procedure. Primarily used for relationship between variables, it can be used for many other robust statistical estimates. I think Rick has a blog post on this.
11-20-2015 09:44 AM
I think LVM might be refering to this post about how to use the ROBUSTREG procedure to compute robust estimates of univariate quantities.
To reiterate LVM's point: Excluding data is a slippery slope. I like to say that we can use robust methods to IDENTIFY outliers and to construct estimates that are ROBUST to outliers, but I rarely use the phrase "exclude outliers."
11-20-2015 06:31 PM
You've gotten some good advice regarding MEANS. Especially the advice to BE CAREFUL.
In your question, though, you used MEANS only as an example. If you want to do regression, I suggest either QUANTREG or ROBUSTREG.
Oddly enough, I wrote a paper on these. Should more of your PROC REGs be QUANTREGs and ROBUSTREGs?
In addition, note that the outliers are often the interesting stuff - automatically excluding outliers would mean that we never discovered black holes. :-).