Outlier detection with ROBUSTREG

PGStats · Posted 08-24-2015 12:08 PM

Sometimes I would like to have some near automatic outlier detection tool. I have in the past trusted ROBUSTREG for that. The following example (run with sas/stat 13.1) makes me wonder if that was so wise...

data test;

input group $ value @@;

datalines;

A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A

40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B

72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B

27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26

C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81

C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19

D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52

D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52

E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81

E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33

E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28

F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74

F 55.95 F 36.64

;

proc robustreg data=test;

class group;

model value = group / cutoff=4;

output outlier=outlier out=outliers;

run;

title;

ods listing image_dpi=300;

ods graphics / height=500 width=600;

proc sgplot data=outliers;

styleattrs datasymbols=(X Circle) datacontrastcolors=(Red Blue);

vbox value / category=group nomean nooutliers;

scatter x=group y=value / group=outlier;

yaxis type=log;

run;

What is your favorite tool/method for "automatic" outlier detection?

PG

PaigeMiller · Posted 08-24-2015 12:21 PM

I have seen this behavior of PROC ROBUSTREG as well, and I don't really know why it happens or what to do about it.

My general answer to my clients is that there is no such thing as a foolproof outlier detection method, but that really doesn't help in obtaining the proper analysis, does it?

--
Paige Miller

PGStats · Posted 08-24-2015 03:29 PM

I agree. Somehow, the cost of data analysis has to follow the lowering cost of samplers, multi-analyzers, sensors and recording devices. We need robust statistical screening tools and methods that will identify data distributions and exclude outliers. Otherwise, outlier screening will become an expensive procedure that most will go without.

PG

Rick_SAS · Posted 08-24-2015 02:09 PM

The default METHOD=M is not my favorite algoritm, and I rarely use it. If you use METHOD=LTS (or MM or S) you will get more reasonable results. (I almost always start with LTS.)

I don't know why the M estimation doesn't work well for these data, but I think it is the way that M estimation handles the generalized MCD algorithm, as explained in the section "Low-Dimensional Structure":

SAS/STAT(R) 14.1 User's Guide

I don't understand the procedure well enough to offer any mathematical insights.

PGStats · Posted 08-24-2015 03:14 PM

Thank you Rick for your insights.

Maybe I should do more exploration, despite the recommendation (which I interpreted as a warning) from the documentation :

"For a model that includes classification independent variables or continuous independent variables with a few unequal values, the M method is recommended."

PG

Rick_SAS · Posted 08-24-2015 07:28 PM

Hmm, I never saw that. That's a pretty strong recommendation. My preference was formed from a small set of empirical experiences from when I first looked at robust models. Your mileage mayvary.

Ksharp · Posted 08-26-2015 01:45 AM

PG,

You ought to take a look at IML documentation.

There is a special chapter to talk about outlier.

Chapter 12

Robust Regression Examples

data test;
input group $ value @@;
datalines;
A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A
40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B
72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B
27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26
C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81
C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19
D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52
D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52
E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81
E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33
E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28
F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74
F 55.95 F 36.64
;
run;



proc iml;
use test;
read all var {value} into y where(group='A');
optn = j(9,1,.);
call lms(scLMS, coefLMS, wgtLMS, optn, y);
call lts(scLTS, coefLTS, wgtLTS, optn, y);
LMSOutliers = loc(wgtLMS[1,]=0);
LTSOutliers = loc(wgtLTS[1,]=0);
print 'Group A' ,LMSOutliers, LTSOutliers;

quit;

Xia Keshan

PaigeMiller · Posted 08-26-2015 08:58 AM

Well that's very interesting, the IML routines give different answers on this data than the ROBUSTREG LTS or LMS method. I assume some default option is different between the two, although I would have previously guessed that they would give the same result.

But whether you do it in IML or PROC ROBUSTREG, the LTS and LMS do not display the problem that the METHOD=M example from PG started this thread with.

Seems like I'm going to have to do some very detailed reading, as I'd sure like to understand what is happening.

--
Paige Miller

Ksharp · Posted 08-27-2015 03:06 AM

PG,

Here is what I found in STAT documentation. M is NOT robust method.

M estimation, which was introduced by Huber (1973), is the simplest approach both computationally

and theoretically. Although it is not robust with respect to leverage points, it is still used extensively in

analyzing data for which it can be assumed that the contamination is mainly in the response direction.

Rick_SAS · Posted 08-27-2015 07:42 AM

M-Estimation is robust with regards to outliers, just not high-leverage points. In the regression context, "outliers" means "uncharacteristic values of the response variable," whereas high-leverage points are uncharacteristic values of the explanatory variables.

PaigeMiller · Posted 08-27-2015 09:37 AM

Is it possible that somehow the result provided by PGstats at the start of this thread is because M-estimation thinks that GROUP=A is a high leverage point? I mean, why else would the results from Group A be what they are under M-estimation? I realize that with categorical explanatory variables there shouldn't be high-leverage points, but I can't think of another reason that M-estimation performs the way it does on Group A.

--
Paige Miller

Rick_SAS · Posted 08-27-2015 05:07 PM

I had that same thought, which is why I linked to the doc section on "Low-Dimensional Structure", which includes the phrase "Low-dimensional structure is often induced by classification covariates." That whole paragraph seems relevant. Also the section on the "Robust MCD algorithm."

It seems that obs in Group=A is getting classified as off-plane leverage points. When you turn on the LEVERAGE option, you can see that all of Group=A observations are marked as high-leverage points. Because the M estimation method is not robust to high-leverage points, I think that is the problem.

To help you diagnose the problem, you might want to specify LEVERAGE(OPC MCDINFO) on the MODEL option. It gives additional info about the MCD algorithm.

Last remark: If you omit one of the groups, such as by using

WHERE group in ('A' 'B' 'C' 'D' 'E');

then the problem goes away, even with the M estimation. Thus the size of the groups (relative to ...what? Breakdown value? Each other?) seems to be important. This helps point out how PG's example differs from the "Robust ANOVA" example in the doc.

PaigeMiller · Posted 08-28-2015 08:32 AM

Thanks, Rick. I'll have to read that carefully. It sure seems to me that this is either a bug or a poor design in the case of all categorical independent variables.

--
Paige Miller

PGStats · Posted 08-28-2015 11:00 AM

The number of options available with M-estimation testifies to the heuristic nature of the algorithm. Although I would very much like to see such an exercise, it is unlikely that PROC ROBUSTREG's version can be honestly compared with another implementation of M-estimation.

PG

Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Re: Outlier detection with ROBUSTREG

Catch up on SAS Innovate 2026