Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Outlier detection with ROBUSTREG

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 08-24-2015 12:08 PM
(2738 views)

Sometimes I would like to have some *near automatic* outlier detection tool. I have in the past trusted ROBUSTREG for that. The following example (run with sas/stat 13.1) makes me wonder if that was so wise...

**data test;**

**input group $ value @@;**

**datalines;**

**A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A**

**40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B**

**72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B**

**27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26**

**C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81**

**C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19**

**D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52**

**D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52**

**E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81**

**E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33**

**E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28**

**F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74**

**F 55.95 F 36.64**

**;**

**proc robustreg data=test;**

**class group;**

**model value = group / cutoff=4;**

**output outlier=outlier out=outliers;**

**run;**

**title;**

**ods listing image_dpi=300;**

**ods graphics / height=500 width=600;**

**proc sgplot data=outliers;**

**styleattrs datasymbols=(X Circle) datacontrastcolors=(Red Blue);**

**vbox value / category=group nomean nooutliers;**

**scatter x=group y=value / group=outlier;**

**yaxis type=log;**

**run;**

<insert picture here>

What is your favorite tool/method for "automatic" outlier detection?

PG

PG

13 REPLIES 13

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I have seen this behavior of PROC ROBUSTREG as well, and I don't really know why it happens or what to do about it.

My general answer to my clients is that there is no such thing as a foolproof outlier detection method, but that really doesn't help in obtaining the proper analysis, does it?

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I agree. Somehow, the cost of data analysis has to follow the lowering cost of samplers, multi-analyzers, sensors and recording devices. We need robust statistical screening tools and methods that will identify data distributions and exclude outliers. Otherwise, outlier screening will become an expensive procedure that most will go without.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The default METHOD=M is not my favorite algoritm, and I rarely use it. If you use METHOD=LTS (or MM or S) you will get more reasonable results. (I almost always start with LTS.)

I don't know why the M estimation doesn't work well for these data, but I think it is the way that M estimation handles the generalized MCD algorithm, as explained in the section "Low-Dimensional Structure":

I don't understand the procedure well enough to offer any mathematical insights.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you Rick for your insights.

Maybe I should do more exploration, despite the recommendation (which I interpreted as a warning) from the documentation :

"For a model that includes classification independent variables or continuous independent variables with a few unequal values, the M method is recommended."

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

PG,

You ought to take a look at IML documentation.

There is a special chapter to talk about outlier.

Chapter 12

Robust Regression Examples

data test; input group $ value @@; datalines; A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A 40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B 72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B 27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26 C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81 C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19 D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52 D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52 E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81 E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33 E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28 F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74 F 55.95 F 36.64 ; run; proc iml; use test; read all var {value} into y where(group='A'); optn = j(9,1,.); call lms(scLMS, coefLMS, wgtLMS, optn, y); call lts(scLTS, coefLTS, wgtLTS, optn, y); LMSOutliers = loc(wgtLMS[1,]=0); LTSOutliers = loc(wgtLTS[1,]=0); print 'Group A' ,LMSOutliers, LTSOutliers; quit;

Xia Keshan

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Well that's very interesting, the IML routines give different answers on this data than the ROBUSTREG LTS or LMS method. I assume some default option is different between the two, although I would have previously guessed that they would give the same result.

But whether you do it in IML or PROC ROBUSTREG, the LTS and LMS do not display the problem that the METHOD=M example from PG started this thread with.

Seems like I'm going to have to do some very detailed reading, as I'd sure like to understand what is happening.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

PG,

Here is what I found in STAT documentation. M is NOT robust method.

*M estimation, which was introduced by Huber (1973), is the simplest approach both computationally*

*and theoretically. Although it is not robust with respect to leverage points, it is still used extensively in*

*analyzing data for which it can be assumed that the contamination is mainly in the response direction.*

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Is it possible that somehow the result provided by PGstats at the start of this thread is because M-estimation thinks that GROUP=A is a high leverage point? I mean, why else would the results from Group A be what they are under M-estimation? I realize that with categorical explanatory variables there shouldn't be high-leverage points, but I can't think of another reason that M-estimation performs the way it does on Group A.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I had that same thought, which is why I linked to the doc section on "Low-Dimensional Structure", which includes the phrase "Low-dimensional structure is often induced by classification covariates." That whole paragraph seems relevant. Also the section on the "Robust MCD algorithm."

It seems that obs in Group=A is getting classified as off-plane leverage points. When you turn on the LEVERAGE option, you can see that all of Group=A observations are marked as high-leverage points. Because the M estimation method is not robust to high-leverage points, I think that is the problem.

To help you diagnose the problem, you might want to specify LEVERAGE(OPC MCDINFO) on the MODEL option. It gives additional info about the MCD algorithm.

Last remark: If you omit one of the groups, such as by using

WHERE group in ('A' 'B' 'C' 'D' 'E');

then the problem goes away, even with the M estimation. Thus the size of the groups (relative to ...what? Breakdown value? Each other?) seems to be important. This helps point out how PG's example differs from the "Robust ANOVA" example in the doc.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks, Rick. I'll have to read that carefully. It sure seems to me that this is either a bug or a poor design in the case of all categorical independent variables.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The number of options available with M-estimation testifies to the heuristic nature of the algorithm. Although I would very much like to see such an exercise, it is unlikely that PROC ROBUSTREG's version can be honestly compared with another implementation of M-estimation.

PG

PG

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.