Hi,
How can I identify outliers and remove them from my database? I used the command below to check the homoscedasticity of variance and normality of errors, as suggested by @SteveDenham but I don't know how to proceed after that.
proc glm;
class cast*drug;
model WBC = cast*drug;
means cast*drug / hovtest = levene(type=ABS) welch;
output out = resids r=residual; run;
proc univariate;
var residual;
qq plot = residual;
run;
1) proc robustreg 2) Check proc glm's OUTPUT statement , especially the COOKD H RESIDUAL .... keywords.
For more on outliers, winsorization, and trimmed means, see also:
1. Define how you're categorizing data as an outlier
2. Define rules for an outlier
3. Extract data as required
Which step are you on?
My question is:
After checking the normality of ERRORS, and since PROC UNIVARIATE identifies the normality of residuals...how do I remove the outliers from the residuals?
/* T0099710 Using Robustreg, Grubb macro, Outlier macro and Ricks tools to identify outliers from sashelp.bweight
Using Robustreg to identify outliers from sashelp.bweight
Ricks Blogs
/* T700509 Detecting outliers in SAS: Part 1: Estimating location
https://goo.gl/EGbdso
http://blogs.sas.com/content/iml/2012/01/20/detecting-outliers-in-sas-part-1-estimating-location.html
/* T700506 Detecting outliers in SAS: Part 2: Estimating scale
https://goo.gl/PhGnOU
http://blogs.sas.com/content/iml/2012/01/27/detecting-outliers-in-sas-part-2-estimating-scale.html
/* T700504 Detecting outliers in SAS: Part 3: Multivariate location and scatter
https://goo.gl/1pKqab
http://blogs.sas.com/content/iml/2012/02/02/detecting-outliers-in-sas-part-3-multivariate-location-and-scatter.html
R package outliers
https://cran.r-project.org/web/packages/outliers/outliers.pdf
My verification and validation tool does more extensive outlier detection
https://www.dropbox.com/s/po3ahepe7r7dnm7/oto_voodoo.sas?dl=0
see doc on end
HAVE a list of 50,000 infant births weights (sashelp.bweight)
=============================================================
Up to 10 obs from sashelp.bweight total obs=50,000
Obs WEIGHT
1 4111
2 3997
3 3572
4 1956
5 3515
6 3757
7 2977
8 3884
9 3629
10 3062
WANT 30 Worst outliers based on a stadardized normal using robustreg
(may want to add the 30 with the highest leverage)
====================================================================
Outlier analsys up to the 30 outliers
Robust Regression with %syfunc(int()) * sigma cuttoff and removal of expected outliers?
Obs OBS SIGMAS WEIGHT
1 45253 6.1760 240
2 4319 6.0902 284
3 17687 6.0902 284
4 18398 6.0355 312
5 6789 6.0160 322
6 7881 6.0004 330
7 1754 5.9809 340
8 44302 5.9809 340
9 11413 5.9614 350
10 5948 5.9244 369
....
25 29420 5.7742 446
26 19603 5.7586 454
27 48553 5.7586 454
28 43037 5.7586 454
29 39248 5.7527 457
30 31299 5.7469 460
GRUBS MACRO
=============
Up to 40 obs from sashelp.bweight total obs=50,000
MIN_ MAX_ MEAN_ STD_
Obs GRBTEST GRBALPHA GRBOBS GRBDROP GRBVALS GRBVALS GRBVALS GRBVALS GRBCALC GRBCRIT GRBPSTAT
1 Max 0.05 50000 34693 240 6350 3370.76 566.385 5.26010 4.75291 0.0036
2 Min 0.05 49999 45253 240 5970 3370.70 566.234 5.52898 4.75291 0.0008
3 Min 0.05 49998 4319 284 5970 3370.76 566.273 5.45101 4.75291 0.0012
4 Min 0.05 49997 17687 284 5970 3370.82 566.317 5.45070 4.75291 0.0012
5 Min 0.05 49996 18398 312 5970 3370.88 566.360 5.40095 4.75291 0.0017
6 Min 0.05 49995 6789 322 5970 3370.94 566.407 5.38296 4.75291 0.0018
7 Min 0.05 49994 7881 330 5970 3371.01 566.455 5.36849 4.75291 0.0020
8 Min 0.05 49993 1754 340 5970 3371.07 566.504 5.35048 4.75291 0.0022
9 Min 0.05 49992 44302 340 5970 3371.13 566.553 5.35012 4.75291 0.0022
10 Min 0.05 49991 11413 350 5970 3371.19 566.603 5.33211 4.75291 0.0024
11 Min 0.05 49990 16672 369 5970 3371.25 566.654 5.29820 4.75291 0.0029
12 Min 0.05 49989 15094 369 5970 3371.31 566.707 5.29782 4.75291 0.0029
13 Min 0.05 49988 5948 369 5970 3371.37 566.760 5.29743 4.75291 0.0029
14 Min 0.05 49987 38068 369 5970 3371.43 566.812 5.29704 4.75291 0.0029
15 Min 0.05 49986 26610 378 5970 3371.49 566.865 5.28078 4.75291 0.0032
WORKING SOLUTION
================
%_vdo_outlyr(
lib =sashelp
,data=bweight
,var=weight
);
%utl_Grubbs(work.bweight,weight,alpha=.05);
FULL SOLUTION
=============
* might be interesting to look at a subset;
data bweight;
set sashelp.bweight(where=(1000 < weight ));
run;quit;
%_vdo_outlyr(
lib =work
,data=bweight
,var=weight
);
30 worst outlier analsys up to the 30 out 481 outliers identified) outliers
Robust Regression with 3.2886758223 * sigma cuttoff and removal of expected outliers?
Obs OBS SIGMAS WEIGHT
1 34539 5.8076 6350
2 25551 5.0578 5970
3 14897 4.7291 1010
4 23172 4.7291 1010
5 45693 4.7094 1020
6 31525 4.7094 1020
7 17803 4.7074 1021
8 21606 4.7074 1021
9 11799 4.7074 1021
%macro _vdo_outlyr(
lib =sashelp
,data=bweight
,var=weight
);
/*
%let lib =sashelp;
%let data=bweight;
%let var=weight;
*/
%local nobs cutoff expected_outliers dotherest;
* grubs test is better but it is too computationally expensive
It recomputes thowing ot the worst outlier;
* I do have a couple of grub test macros in R and SAS;
* arguments have already been checked by the call driver;
* estimate the best cuttoff in terms of the sigma on standardized values;
* number of obs in source data;
proc sql noprint; select count(*) into :nobs from &lib..&data;quit;
%put &=nobs;
* get the best cutoff in terms of N * sigma and the expected outliers;
data _null;
* (e^x)^2) -> sqrt(log(x));
cutoff=sqrt(log(&nobs));
call symputx('cutoff',cutoff);
put cutoff=;
expected_outliers=2 * &nobs *(1-cdf('normal',cutoff));
call symputx('expected_outliers',expected_outliers);
put expected_outliers=;
run;quit;
* robust regression - only interested in outliers - leverage values might also be interesting;
ods exclude all;
ods output diagnostics=__vvdag;
proc robustreg data=&lib..&data method=MM;
model &var = /diagnostics cutoff=&cutoff /* stadardized sigma &cutoff * sigma */;
run;
ods select all;
ods listing;
* number of potential outliers;
proc sql noprint;select count(*) into :obs from __vvdag;quit;
* do we have more than the expected outliers;
%let dotherest=%sysfunc(ifc(&obs > &expected_outliers,1,0));
%put &=dotherest;
* do we have outliers;
%if &dotherest %then %do; * YES? ;
* set up for sort by abs value;
data _vvdagabs;
set __vvdag;
rresidual=abs(rresidual);
run;quit;
/* sort by abs value so we can remove the lower expected_outliers */
proc sort data=_vvdagabs out=_vvdagsrt noequals;
by rresidual;
run;quit;
/* drop the lower values */
data _vvdagsel;
set _vvdagsrt(firstobs=%sysfunc(int(&expected_outliers))); * remove expected outliers;
run;quit;
* go back to full data and get orginal value;
* get bad values;
data _vvdagget;
do until (dne);
set _vvdagsel(drop=outlier) end=dne;
rec=obs;
set &lib..&data(keep=&var) point=rec;
output;
end;
stop;
run;quit;
proc sort data=_vvdagget out=_vvdagfin noequals;
by descending rresidual;
run;quit;
title1 ' ';title2 ' ';title3 ' ' ;
TITLE4 "30 worst outlier analsys up to the 30 out &obs outliers identified) outliers";
TITLE5 "Robust Regression with &cutoff * sigma cuttoff and removal of expected outliers?";
proc print data=_vvdagfin(obs=30 rename=rresidual=sigmas) width=min;
run;quit;
title;
%end;
%else %do;
data _null_;
file print;
put "Outlier analsys up to the 30 outliers(&obs outliers identified)";
put "Robust Regression with &cutoff * sigma cuttoff and removal of expected outliers?";
put "***********************************";
put " No outliers using robustreg";
put "***********************************";
run;quit;
%end;
proc datasets lib=work;
delete _vvdag:;
run;quit;
%mend _vdo_outlyr;
%_vdo_outlyr;
/* T002680 GRUBBS TEST FOR OUTLIERS USING SAS */
%macro utl_Grubbs(dsn,var,alpha=.1)
/des ="Alternating removal of Max and Min Outliers using Grubbs test";
/* output dataset is the input dataset with outlier flag */
/* for testing without macro
data shoes;set sashelp.shoes(keep=sales);run;
%let dsn=work.shoes;
%let var=sales;
%let alpha=.1;
*/
%local utl_grbnobs;
%put %sysfunc(ifc(%sysevalf(%superq(dsn)=,boolean),**** Please Input dataset ****,));
%put %sysfunc(ifc(%sysevalf(%superq(var)=,boolean),**** Please Input Variable ****,));
proc means data=&dsn;
var &var;
output out=__out__ mean=mean std=std max=max min=min n=n;
run;
/*------------------------------------------------------------*\
| Add a unique key so that Outliers can be removed |
\*------------------------------------------------------------*/
Data Utl_Grubbs00 / View = Utl_Grubbs00;
Set &dsn(Keep=&var);
Utl_GrbKey=_n_;
Run;
/*------------------------------------------------------------*\
| Sort so that we can trim both ends easily |
\*------------------------------------------------------------*/
Proc Sort Data=Utl_Grubbs00
Out =Utl_Grubbs01
Noequals
Force;
By &var;
run;
/*------------------------------------------------------------*\
| Need to get number of obds and Type and Length for Key |
\*------------------------------------------------------------*/
Proc Sql Noprint;
Select Put(Nobs,12.) into :Utl_GrbNobs
From SASHELP.VTable
Where Upcase("WORK") = Upcase(LibName) AND
Upcase("Utl_Grubbs01") = Upcase(MemName) AND
Upcase("Data") = Upcase(MemType);
quit;
run;
/*------------------------------------------------------------*\
| Compute the Grubbs Statistics |
\*------------------------------------------------------------*/
Data
Utl_Grubbs02
(
Keep=
GrbTest
GrbAlpha
GrbObs
GrbDrop
Min_GrbVals
Max_GrbVals
Mean_GrbVals
Std_GrbVals
GrbCalc
GrbCrit
GrbPStat
);
Retain GrbNobs &Utl_GrbNobs GrbAlpha Α
/*------------------------------------------------------------*\
| Allocate arrays to hold all data |
\*------------------------------------------------------------*/
Array GrbVals {&Utl_GrbNobs} _Temporary_;
Array GrbKeys {&Utl_GrbNobs} _Temporary_;
/*------------------------------------------------------------*\
| Load all Data into Temp Arrays - Temp Arrays can have |
| millions of elements? |
\*------------------------------------------------------------*/
GrbN=0;
Do Until ( Done );
Set Utl_Grubbs01 End=Done;
GrbN+1;
GrbKeys{GrbN} = Utl_GrbKey;
GrbVals{GrbN} = &var;
End;
/*------------------------------------------------------------*\
| Cannot use more elegant Mean(of Array), Std(Of Array) |
| SAS does not support these funtions on Temporary Arrays |
\*------------------------------------------------------------*/
GrbObs=GrbNobs;
GrbLo=1;
GrbHi=GrbObs;
Do Until ( GrbFlag = 1 ) ;
GrbFlag=0;
Link BasicStats;
* Pass=GrbLo,GrbHi,Array GrbVals
Return=Mean,Max,Min,Std;
* Put "Max Basic Stats ==> " /
GrbObs = /
GrbHi = /
GrbLo = /
Min_GrbVals = /
Max_GrbVals = /
Mean_GrbVals = /
Std_GrbVals = ;
/*------------------------------------------------------------*\
| Maximum Outlier Case |
\*------------------------------------------------------------*/
GrbTest='Max';
GrbCalc =( Max_GrbVals - Mean_GrbVals ) / Std_GrbVals ;
GrbTStat =TInv(1-&Alpha/GrbNobs,GrbNobs-2);
Link GrbCompute;
* Pass=GrbCalc,GrbTStat,GrbObs
Return=GrbCrit GrbStat GrbPStat;
* put // "Max Grubbs Stats ==> " /
GrbCrit= /
GrbStat= /
GrbDenom= /
GrbPStat=;
If GrbPStat < GrbAlpha Then
GrbHi = GrbHi - 1;
Else GrbFlag = GrbFlag + .5;
* put // GrbFlag= //;
/*------------------------------------------------------------*\
| Recompute Basic Stats without Outlier |
\*------------------------------------------------------------*/
Link BasicStats; * (GrbLo,GrbHi);
* Put // "Min Basic Stats ==> " /
GrbObs = /
GrbObs = /
GrbHi = /
Min_GrbVals = /
Max_GrbVals = /
Mean_GrbVals = /
Std_GrbVals = ;
/*------------------------------------------------------------*\
| Minimum Outlier Case |
\*------------------------------------------------------------*/
GrbTest='Min';
GrbCalc =( Mean_GrbVals - Min_GrbVals) / Std_GrbVals ;
GrbTStat =TInv(&alpha/GrbNobs,GrbNobs-2);
Link GrbCompute;
* Put // "Min Grubbs Stats ==> " /
GrbCrit= /
GrbStat= /
GrbDenom= /
GrbPStat=;
If GrbPStat < GrbAlpha Then
GrbLo = GrbLo + 1;
Else GrbFlag = GrbFlag + .5;
* put // GrbFlag= // I=;
End;
Stop;
ENDIT: Put "Cannot Have missing values for Obsevations";
Stop;
/*------------------------------------------------------------*\
| Compute Grubbs Statistics |
\*------------------------------------------------------------*/
GrbCompute:
GrbCrit =((GrbObs-1)/sqrt(GrbObs))*sqrt(GrbTStat**2/(GrbObs-2+GrbTStat**2));
GrbDenom =(GrbCalc**2*GrbObs-(GrbObs-1)**2);
Select;
When ( GrbDenom =. ) GoTo ENDIT;
When ( GrbDenom < 0 ) GrbStat=sqrt(-(GrbCalc**2*GrbObs*(GrbObs-2)/GrbDenom));
When ( GrbDenom = 0 ) GrbStat=sqrt(GrbCalc**2*GrbObs*(GrbObs-2)/GrbDenom);
OtherWise ;
End;
If GrbStat =. then GoTo ENDIT;
GrbPStat=Min(GrbObs*(1-ProbT(GrbStat,GrbObs-2)),1);
If GrbPStat < GrbAlpha Then Do;
If GrbTest='Min' Then GrbDrop=GrbKeys{GrbLo};
Else GrbDrop=GrbKeys{GrbHi};
Output;
End;
Return;
/*------------------------------------------------------------*\
| Compute MeanStandard Deviation |
\*------------------------------------------------------------*/
BasicStats:
GrbObs=( GrbHi - GrbLo + 1 );
SumSq_GrbVals=0;
Sum_GrbVals=0;
Do I = GrbLo To GrbHi;
SumSq_GrbVals + GrbVals{I}**2;
Sum_GrbVals + GrbVals{I};
End;
Std_GrbVals=Sqrt(
( GrbHi*SumSq_GrbVals - Sum_GrbVals**2 ) /
( GrbObs* ( GrbObs- 1 ) )
);
Max_GrbVals=GrbVals{GrbHi};
Min_GrbVals=GrbVals{GrbLo};
Mean_GrbVals= Sum_GrbVals / GrbObs;
Return;
Run;
Proc print Data=utl_Grubbs02 Width=Min;
Format GrbPStat 6.4;
Var
GrbTest
GrbAlpha
GrbObs
GrbDrop
Min_GrbVals
Max_GrbVals
Mean_GrbVals
Std_GrbVals
GrbCalc
GrbCrit
GrbPStat;
Run;
/*------------------------------------------------------------*\
| Put in same order as original input |
\*------------------------------------------------------------*/
Proc Sort
Data = Utl_Grubbs01
Out = Utl_Grubbs01(Index= ( Utl_GrbKey / Unique ));
By Utl_GrbKey;
Run;
/*------------------------------------------------------------*\
| Tag the Outliers ( Use SQL just to keep skills up ) |
\*------------------------------------------------------------*/
Proc Sql;
Alter Table Utl_Grubbs01
Modify _Outlier_ num;
Update Utl_Grubbs01
Set _Outlier_ =
(
Select 1
From utl_Grubbs02
Where GrbDrop = Utl_GrbKey
);
Quit;
Run;
/*------------------------------------------------------------*\
| Concatenate _Outlier_ onto original raw data |
\*------------------------------------------------------------*/
options mergenoby=nowarn;
Data &dsn;
Merge &dsn
Utl_Grubbs01(Keep=_Outlier_);
/*------------------------------------------------------------*\
| No By Statement ( Data in Exactly the same order ) |
| By Variable not on Raw Input Table |
\*------------------------------------------------------------*/
If _Outlier_ = . then _Outlier_ = 0;
Run;
options mergenoby=warn;
Proc Print
Data=&dsn(where=(_Outlier_=1));
run;
%mend Utl_Grubbs;
data bweight;set sashelp.bweight(keep=weight);run;
%utl_Grubbs(work.bweight,weight,alpha=.05);
/* T1001580 Latest VooDoo oto_voodoo check out dataset
I run the macro oto_voodoo on any new data sent to me
see for more documentation
https://drive.google.com/file/d/0ByX2ii2B0Rq9Y003ZXZFT1pBV1U/view?usp=sharing
I run this on any new data sent to me
I don't think this will run under server EG, best with 'old text editor?'
WHAT THE MACRO DOES
1. Dataset level summary -- ie number of obs, variable types, static data
2. Cardinality page (primary keys, codes/decodes number of unique
values for every variable - EG fails here)
3. Complete frequency for all variables numeric and character with less
than 200 levels
4. For variables with over 200 levels top 100 most frequent and bottom
100 least frequent
Least frequent are the more interesting cases.
5. Proc means on all numeric variables
6. Proc univariate on all numeric variables
7. Special datetime analysis
9. Histograms on all variables with less than 200 levels
10. Proc contents
11. Frequency of all numeric variables Missing, negative, zero and positive
12. Duplicates on single or compound key. Output printed vertically for
easy comparison
13. Cross tabs of every variable with every other variable top 16 levels
(if selectd)
14. You can also select one variable to cross tab with all other variables
max top 16 levels
16. Maximum and minimum lengths to hold all numeric and character variables
exactly (optimize)
17. Correlation of all pairs of numeric variables sorted by largest
correlation to lowest.
18. Nice display of max and mins for numeric and character in one table
19. List of identical columns ie date and date1 have equal values on all
observations
19 One to Many, Many to One, One to Many and Many to Many
20 Cochran-Mantel-Haenszel Statistics
21 Finds missing patterns (missing pattern frequncies)
22 Printout of first 20, middle 20 and last 20 observations.
23. Correlation of all pairs of character variables (CMH)
.
/* for easy editing here are the locations macros un oto_voodoo
prefix area helps
%macro utlnopts 56
%macro _vdo_macnam 93
%macro utlfkil 109
%macro nobs 158
%macro nvar 199
%macro _vdo_cdedec 274
%macro _vv_annxtb 306
%macro _vdo_basic 418
%macro _vdo_optlen 3051
%macro _vdo_getmaxmin 3144
%macro _vdo_getmaxmin001 3169
%macro _vdo_begmidend 3237
%macro _vdo_clean 3331
%macro _vdo_chartx 3402
%macro _vdo_mispop 3606
%macro _vdo_keyunq 3653
%macro _vdo_dupcol 3734
%macro _vdo_cor 3819
%macro _vdo_mnymny 3882
%macro _vdo_relhow 3897
%macro _vdo_cmh 4020
%macro _vdo_tabone 4116
%macro _vdo_taball 4181
%macro _vdo_unqtwo 4261
%macro utl_getstm 4463
%macro DirExist 4476
%macro utlvdoc 4488
%macro _vdo_unichr 4495
I run the macro 'oto_voodoo' on any new data sent to me
see for more documentation
https://drive.google.com/file/d/0ByX2ii2B0Rq9Y003ZXZFT1pBV1U/view?usp=sharing
I run this on any new data sent to me
I don't think this will run under server EG, best with 'old text editor?'
Compute and memory intensive, runs ok on a power workstation
WHAT THE MACRO DOES
1. Dataset level summary -- ie number of obs, variable types, static data
2. Cardinality page (primary keys, codes/decodes number of unique
values for every variable - EG often fails here)
3. Complete frequency for all variables numeric and character with less
than 200 levels
4. For variables with over 200 levels top 100 most frequent and bottom
100 least frequent
Least frequent are the more interesting cases.
5. Proc means on all numeric variables
6. Proc univariate on all numeric variables
7. Special datetime analysis
9. Histograms on all variables with less than 200 levels
10. Proc contents
11. Frequency of all numeric variables Missing, negative, zero and positive
12. Duplicates on single or compound key. Output printed vertically for
easy comparison
13. Cross tabs of every variable with every other variable top 16 levels
(if selectd)
14. You can also select one variable to cross tab with all other variables
max top 16 levels
16. Maximum and minimum lengths to hold all numeric and character variables
exactly (optimize)
17. Correlation of all pairs of numeric variables sorted by largest
correlation to lowest.
18. Nice display of max and mins for numeric and character in one table
19. List of identical columns ie date and date1 have equal values on all
observations
19 One to Many, Many to One, One to Many and Many to Many
20 Cochran-Mantel-Haenszel Statistics
21 Finds missing patterns (missing pattern frequncies)
22 Printout of first 20, middle 20 and last 20 observations.
23. Correlation of all pairs of character variables (CMH)
.
1) proc robustreg 2) Check proc glm's OUTPUT statement , especially the COOKD H RESIDUAL .... keywords.
For more on outliers, winsorization, and trimmed means, see also:
I'm a novice user to statistics. Could you please clarify me with one simple example to remove outliers with proc robustreg and proc glm?
Thank you for any inputs you offer me.
For proc robustreg, check the example in documentation. For proc glm, check COOKD, if it is very large compared to other obs,then this obs should be outlier.
SAS is headed back to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team.
Interested in speaking? Content from our attendees is one of the reasons that makes SAS Innovate such a special event!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.