Programming the statistical procedures from SAS

Outlier detection with ROBUSTREG

Reply
Respected Advisor
Posts: 4,606

Outlier detection with ROBUSTREG

Sometimes I would like to have some near automatic outlier detection tool. I have in the past trusted ROBUSTREG for that. The following example (run with sas/stat 13.1) makes me wonder if that was so wise...

data test;

input group $ value @@;

datalines;

A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A

40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B

72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B

27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26

C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81

C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19

D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52

D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52

E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81

E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33

E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28

F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74

F 55.95 F 36.64

;

proc robustreg data=test;

class group;

model value = group / cutoff=4;

output outlier=outlier out=outliers;

run;

title;

ods listing image_dpi=300;

ods graphics / height=500 width=600;

proc sgplot data=outliers;

styleattrs datasymbols=(X Circle) datacontrastcolors=(Red Blue);

vbox value / category=group nomean nooutliers;

scatter x=group y=value / group=outlier;

yaxis type=log;

run;

<insert picture here>

What is your favorite tool/method for "automatic" outlier detection?

PG

PG
Trusted Advisor
Posts: 1,444

Re: Outlier detection with ROBUSTREG

I have seen this behavior of PROC ROBUSTREG as well, and I don't really know why it happens or what to do about it.

My general answer to my clients is that there is no such thing as a foolproof outlier detection method, but that really doesn't help in obtaining the proper analysis, does it?

Respected Advisor
Posts: 4,606

Re: Outlier detection with ROBUSTREG

I agree. Somehow, the cost of data analysis has to follow the lowering cost of samplers, multi-analyzers, sensors and recording devices. We need robust statistical screening tools and methods that will identify data distributions and exclude outliers. Otherwise, outlier screening will become an expensive procedure that most will go without.

PG

PG
SAS Super FREQ
Posts: 3,319

Re: Outlier detection with ROBUSTREG

The default METHOD=M is not my favorite algoritm, and I rarely use it. If you use METHOD=LTS (or MM or S) you will get more reasonable results.  (I almost always start with LTS.)

I don't know why the M estimation doesn't work well for these data, but I think it is the way that M estimation handles the generalized MCD algorithm, as explained in the section "Low-Dimensional Structure":

SAS/STAT(R) 14.1 User's Guide

I don't understand the procedure well enough to offer any mathematical insights.

Respected Advisor
Posts: 4,606

Re: Outlier detection with ROBUSTREG

Thank you Rick for your insights.

Maybe I should do more exploration, despite the recommendation (which I interpreted as a warning) from the documentation :

"For a model that includes classification independent variables or continuous independent variables with a few unequal values, the M method is recommended."

PG

PG
SAS Super FREQ
Posts: 3,319

Re: Outlier detection with ROBUSTREG

Hmm, I never saw that. That's a pretty strong recommendation. My preference was formed from a small set of empirical experiences from when I first looked at robust models. Your mileage mayvary.

Grand Advisor
Posts: 9,475

Re: Outlier detection with ROBUSTREG

PG,

You ought to take a look at IML documentation.

There is a special chapter to talk about outlier.

Chapter 12

Robust Regression Examples

data test;
input group $ value @@;
datalines;
A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A
40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B
72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B
27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26
C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81
C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19
D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52
D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52
E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81
E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33
E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28
F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74
F 55.95 F 36.64
;
run;



proc iml;
use test;
read all var {value} into y where(group='A');
optn = j(9,1,.);
call lms(scLMS, coefLMS, wgtLMS, optn, y);
call lts(scLTS, coefLTS, wgtLTS, optn, y);
LMSOutliers = loc(wgtLMS[1,]=0);
LTSOutliers = loc(wgtLTS[1,]=0);
print 'Group A' ,LMSOutliers, LTSOutliers;

quit;


x.png

Xia Keshan

Trusted Advisor
Posts: 1,444

Re: Outlier detection with ROBUSTREG

Well that's very interesting, the IML routines give different answers on this data than the ROBUSTREG LTS or LMS method. I assume some default option is different between the two, although I would have previously guessed that they would give the same result.

But whether you do it in IML or PROC ROBUSTREG, the LTS and LMS do not display the problem that the METHOD=M example from PG started this thread with.

Seems like I'm going to have to do some very detailed reading, as I'd sure like to understand what is happening.

Grand Advisor
Posts: 9,475

Re: Outlier detection with ROBUSTREG

PG,

Here is what I found in STAT documentation. M is NOT robust method.

M estimation, which was introduced by Huber (1973), is the simplest approach both computationally

and theoretically. Although it is not robust with respect to leverage points, it is still used extensively in

analyzing data for which it can be assumed that the contamination is mainly in the response direction.

SAS Super FREQ
Posts: 3,319

Re: Outlier detection with ROBUSTREG

M-Estimation is robust with regards to outliers, just not high-leverage points. In the regression context, "outliers" means "uncharacteristic values of the response variable," whereas high-leverage points are uncharacteristic values of the explanatory variables.

Trusted Advisor
Posts: 1,444

Re: Outlier detection with ROBUSTREG

Is it possible that somehow the result provided by PGstats at the start of this thread is because M-estimation thinks that GROUP=A is a high leverage point? I mean, why else would the results from Group A be what they are under M-estimation? I realize that with categorical explanatory variables there shouldn't be high-leverage points, but I can't think of another reason that M-estimation performs the way it does on Group A.

SAS Super FREQ
Posts: 3,319

Re: Outlier detection with ROBUSTREG

I had that same thought, which is why I linked to the doc section on "Low-Dimensional Structure", which includes the phrase "Low-dimensional structure is often induced by classification covariates."  That whole paragraph seems relevant. Also the section on the "Robust MCD algorithm."

It seems that obs in Group=A is getting classified as off-plane leverage points. When you turn on the LEVERAGE option, you can see that all of Group=A observations are marked as high-leverage points. Because the M estimation method is not robust to high-leverage points, I think that is the problem.

To help you diagnose the problem, you might want to specify LEVERAGE(OPC MCDINFO) on the MODEL option.  It gives additional info about the MCD algorithm.

Last remark:  If you omit one of the groups, such as by using

WHERE group in ('A' 'B' 'C' 'D' 'E');

then the problem goes away, even with the M estimation.  Thus the size of the groups (relative to ...what? Breakdown value? Each other?) seems to be important.  This helps point out how PG's example differs from the "Robust ANOVA" example in the doc.

Trusted Advisor
Posts: 1,444

Re: Outlier detection with ROBUSTREG

Thanks, Rick. I'll have to read that carefully. It sure seems to me that this is either a bug or a poor design in the case of all categorical independent variables.

Respected Advisor
Posts: 4,606

Re: Outlier detection with ROBUSTREG

The number of options available with M-estimation testifies to the heuristic nature of the algorithm. Although I would very much like to see such an exercise, it is unlikely that PROC ROBUSTREG's version can be honestly compared with another implementation of M-estimation.

PG

PG
Ask a Question
Discussion stats
  • 13 replies
  • 528 views
  • 3 likes
  • 4 in conversation