BookmarkSubscribeRSS Feed
PGStats
Opal | Level 21

Sometimes I would like to have some near automatic outlier detection tool. I have in the past trusted ROBUSTREG for that. The following example (run with sas/stat 13.1) makes me wonder if that was so wise...

data test;

input group $ value @@;

datalines;

A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A

40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B

72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B

27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26

C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81

C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19

D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52

D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52

E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81

E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33

E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28

F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74

F 55.95 F 36.64

;

proc robustreg data=test;

class group;

model value = group / cutoff=4;

output outlier=outlier out=outliers;

run;

title;

ods listing image_dpi=300;

ods graphics / height=500 width=600;

proc sgplot data=outliers;

styleattrs datasymbols=(X Circle) datacontrastcolors=(Red Blue);

vbox value / category=group nomean nooutliers;

scatter x=group y=value / group=outlier;

yaxis type=log;

run;

<insert picture here>

What is your favorite tool/method for "automatic" outlier detection?

PG

PG
13 REPLIES 13
PaigeMiller
Diamond | Level 26

I have seen this behavior of PROC ROBUSTREG as well, and I don't really know why it happens or what to do about it.

My general answer to my clients is that there is no such thing as a foolproof outlier detection method, but that really doesn't help in obtaining the proper analysis, does it?

--
Paige Miller
PGStats
Opal | Level 21

I agree. Somehow, the cost of data analysis has to follow the lowering cost of samplers, multi-analyzers, sensors and recording devices. We need robust statistical screening tools and methods that will identify data distributions and exclude outliers. Otherwise, outlier screening will become an expensive procedure that most will go without.

PG

PG
Rick_SAS
SAS Super FREQ

The default METHOD=M is not my favorite algoritm, and I rarely use it. If you use METHOD=LTS (or MM or S) you will get more reasonable results.  (I almost always start with LTS.)

I don't know why the M estimation doesn't work well for these data, but I think it is the way that M estimation handles the generalized MCD algorithm, as explained in the section "Low-Dimensional Structure":

SAS/STAT(R) 14.1 User's Guide

I don't understand the procedure well enough to offer any mathematical insights.

PGStats
Opal | Level 21

Thank you Rick for your insights.

Maybe I should do more exploration, despite the recommendation (which I interpreted as a warning) from the documentation :

"For a model that includes classification independent variables or continuous independent variables with a few unequal values, the M method is recommended."

PG

PG
Rick_SAS
SAS Super FREQ

Hmm, I never saw that. That's a pretty strong recommendation. My preference was formed from a small set of empirical experiences from when I first looked at robust models. Your mileage mayvary.

Ksharp
Super User

PG,

You ought to take a look at IML documentation.

There is a special chapter to talk about outlier.

Chapter 12

Robust Regression Examples

data test;
input group $ value @@;
datalines;
A 45.42 A 49.57 A 42.83 A 26.75 A 44.27 A 210.6 A 44.77 A 56.61 A 46.28 A
40.34 A 39.98 A 2943 A 48.07 A 37.85 B 39.44 B 33.70 B 29.52 B 48.46 B
72.26 B 37.40 B 42.26 B 46.35 B 37.92 B 51.14 B 46.09 B 946.4 B
27.96 B 937.3 B 63.94 B 56.43 C 135.2 C 140.5 C 59.95 C 47.26
C 83.87 C 38.97 C 416.8 C 48.40 C 58.31 C 33.75 C 60.81
C 48.75 C 47.87 C 77.29 C 52.63 C 33.22 D 58.97 D 61.19
D 15.81 D 51.29 D 59.18 D 25.59 D 54.24 D 38.08 D 60.52
D 44.04 D 41.06 D 89.18 D 39.36 D 35.28 D 38.78 D 28.52
E 70.79 E 64.98 E 52.34 E 61.38 E 68.33 E 74.93 E 58.81
E 77.25 E 84.37 E 46.38 E 42.92 E 62.02 E 34.14 E 49.33
E 77.78 F 212.5 F 54.76 F 73.75 F 46.41 F 60.10 F 57.28
F 72.16 F 38.20 F 37.43 F 85.98 F 35.17 F 23.05 F 30.74
F 55.95 F 36.64
;
run;



proc iml;
use test;
read all var {value} into y where(group='A');
optn = j(9,1,.);
call lms(scLMS, coefLMS, wgtLMS, optn, y);
call lts(scLTS, coefLTS, wgtLTS, optn, y);
LMSOutliers = loc(wgtLMS[1,]=0);
LTSOutliers = loc(wgtLTS[1,]=0);
print 'Group A' ,LMSOutliers, LTSOutliers;

quit;


x.png

Xia Keshan

PaigeMiller
Diamond | Level 26

Well that's very interesting, the IML routines give different answers on this data than the ROBUSTREG LTS or LMS method. I assume some default option is different between the two, although I would have previously guessed that they would give the same result.

But whether you do it in IML or PROC ROBUSTREG, the LTS and LMS do not display the problem that the METHOD=M example from PG started this thread with.

Seems like I'm going to have to do some very detailed reading, as I'd sure like to understand what is happening.

--
Paige Miller
Ksharp
Super User

PG,

Here is what I found in STAT documentation. M is NOT robust method.

M estimation, which was introduced by Huber (1973), is the simplest approach both computationally

and theoretically. Although it is not robust with respect to leverage points, it is still used extensively in

analyzing data for which it can be assumed that the contamination is mainly in the response direction.

Rick_SAS
SAS Super FREQ

M-Estimation is robust with regards to outliers, just not high-leverage points. In the regression context, "outliers" means "uncharacteristic values of the response variable," whereas high-leverage points are uncharacteristic values of the explanatory variables.

PaigeMiller
Diamond | Level 26

Is it possible that somehow the result provided by PGstats at the start of this thread is because M-estimation thinks that GROUP=A is a high leverage point? I mean, why else would the results from Group A be what they are under M-estimation? I realize that with categorical explanatory variables there shouldn't be high-leverage points, but I can't think of another reason that M-estimation performs the way it does on Group A.

--
Paige Miller
Rick_SAS
SAS Super FREQ

I had that same thought, which is why I linked to the doc section on "Low-Dimensional Structure", which includes the phrase "Low-dimensional structure is often induced by classification covariates."  That whole paragraph seems relevant. Also the section on the "Robust MCD algorithm."

It seems that obs in Group=A is getting classified as off-plane leverage points. When you turn on the LEVERAGE option, you can see that all of Group=A observations are marked as high-leverage points. Because the M estimation method is not robust to high-leverage points, I think that is the problem.

To help you diagnose the problem, you might want to specify LEVERAGE(OPC MCDINFO) on the MODEL option.  It gives additional info about the MCD algorithm.

Last remark:  If you omit one of the groups, such as by using

WHERE group in ('A' 'B' 'C' 'D' 'E');

then the problem goes away, even with the M estimation.  Thus the size of the groups (relative to ...what? Breakdown value? Each other?) seems to be important.  This helps point out how PG's example differs from the "Robust ANOVA" example in the doc.

PaigeMiller
Diamond | Level 26

Thanks, Rick. I'll have to read that carefully. It sure seems to me that this is either a bug or a poor design in the case of all categorical independent variables.

--
Paige Miller
PGStats
Opal | Level 21

The number of options available with M-estimation testifies to the heuristic nature of the algorithm. Although I would very much like to see such an exercise, it is unlikely that PROC ROBUSTREG's version can be honestly compared with another implementation of M-estimation.

PG

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 2391 views
  • 3 likes
  • 4 in conversation