BookmarkSubscribeRSS Feed
NonSleeper
Quartz | Level 8

Several continuous variables in my data have outliers/extreme values. Instead of assigning them missing values, I wonder if there's a more non-invasive approach, i.e., to leave outliers untouched but run statistics that exclude them. For example, how can I run means of a variable that excludes 5 observations with lowest values, and maybe 5 others highest values?

8 REPLIES 8
PGStats
Opal | Level 21

Look at winsorized means and trimmed means in proc univariate.

PG
NonSleeper
Quartz | Level 8

Thanks for your hint. I review these options and trimmed means looks like what I need. 

 

There's another related question: It seems that trimmed option in PROC UNIVARIATE equally cuts off two tails of the variables. But is it possible to exclude only one tail? For example, there are cases that distributions of outliers are skewed (or only present) on one end, either left or right. Hence, sometimes we wish to put restriction on just one tail of the distribution.

FreelanceReinh
Jade | Level 19

As far as I know, there are no one-sided versions of trimmed or Winsorized means implemented in SAS. Of course, they can be calculated fairly easily: In the case of one-sided trimmed means a simple WHERE condition can exclude the extreme observations, after a suitable cutoff point has been determined. The question remains (not only in the one-sided case) where to put the cutoff point. The answer will depend on the statistical model used. As you say, your data includes several variables, which makes it more difficult, because it is not straightforward how to define extremeness for multivariate data. Please note that an element of a multivariate sample could be an outlier without having an extreme value in any of its components.

 

The classic standard reference on the subject, "Outliers in Statistical Data", includes a section on "Accomodation of outliers in gamma (including exponential) samples" (p. 174 ff.) where various proposals for this univariate, asymmetric setting are discussed, including the one-sided trimmed mean. A corresponding section reviews discordancy tests for this class of models (p. 193 ff.). Part III of that book devotes more than 100 pages to multivariate and structured data.

PGStats
Opal | Level 21

Symmetric trimming or winsorizing might seem inefficient. Why sacrifice or alter some of your good data because of some other bad data? But if you can make the assumption that the bad data are nevertheless on the correct side of the distribution (e.g. very very large values actually represent large values) then removing them will systematically shift the location of the distribution. That's why it is safer to perform trimming or winsorization symmetrically. 

PG
Rick_SAS
SAS Super FREQ

The simplest mechanism is to change from using classical estimators (like the mean and standard deviation) to robust estimators (like the median and inter-quartile range.  By using a robust estimator, you avoid making distributional assumptions about your data.

 

PROC UNIVARIATE and other SAS procedures support robust estimates of location and scale. See:

Detecting outliers in SAS: Part 1: Estimating location

Detecting outliers in SAS: Part 2: Estimating scale

 

lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12

Be careful. You are in dangerous territory. Your idea of deleting only the largest observations (or the smallest, but not both) can lead to arbitrary results. Rick has a better suggestion: use various robust statistical methods that deal with extreme observations in an appropriate manner. I highly recommend the quantile regression procedure. Primarily used for relationship between variables, it can be used for many other robust statistical estimates. I think Rick has a blog post on this.

Rick_SAS
SAS Super FREQ

I think LVM might be refering to this post about how to use the ROBUSTREG procedure to compute robust estimates of univariate quantities.

 

To reiterate LVM's point:  Excluding data is a slippery slope.  I like to say that we can use robust methods to IDENTIFY outliers and to construct estimates that are ROBUST to outliers, but I rarely use the phrase "exclude outliers."

plf515
Lapis Lazuli | Level 10

You've gotten some good advice regarding MEANS.  Especially the advice to BE CAREFUL.

 

In your question, though, you used MEANS only as an example.  If you  want to do regression, I suggest either QUANTREG or ROBUSTREG.

 

Oddly enough, I wrote a paper on these. Should more of your PROC REGs be QUANTREGs and ROBUSTREGs?

 

In addition, note that the outliers are often the interesting stuff - automatically excluding outliers would mean that we never discovered black holes. :-).

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 6195 views
  • 6 likes
  • 6 in conversation