BookmarkSubscribeRSS Feed
pearson101
Calcite | Level 5

Hello,

             My data is a time series with multiple variables by districts. (District |date| Variable1)  

 

            I  looked online to fine a solution to dealing with outliers. I found one saying to calculate the IQR (interquantile range Q3 minus Q1) then multiply by 1.5 and adding the amount to Q3 and substracting that from Q1 (lower limit). But I am not sure how to actually code it to produce a output data without outliers. 

 

I also found one suggesting the following code:

proc univariate data=" " robustscale plot;
var  varname;
run; 

 

I tried it and produces results like this among other graphs for this variable.

Quantiles (Definition 5)

Level

Quantile

100% Max

1714.8982

99%

300.1324

95%

117.2804

90%

75.9922

75% Q3

35.1522

50% Median

13.0514

25% Q1

0.0000

10%

0.0000

5%

0.0000

1%

0.0000

0% Min

0.0000


Now that I have this information, how can I code for it to remove or treat outliers in my dataset with a code? I can't seem to find that code.

5 REPLIES 5
Astounding
PROC Star

Many people on this board could do the programming.  But you have to do the hard part.  What makes a value an outlier?  There are many ways to answer the question, and an answer is required before programming can begin.

 

The most flexible way I have used defines outliers as more than X standard deviations above the mean, or less than X standard deviations below the mean.  "X" can actually be flexible and can be a parameter fed to a macro.  But there are many other plausible definitions and it is up to you to pick one if you want help with the programming.

pearson101
Calcite | Level 5

If I want to define outlier as more or less than 3 standard deviation from the mean. Could you help me with the programming for this definition?

Astounding
PROC Star

Here is what I hope is working code (it's untested).  It assumes X is the name of the variable you want to cap, and HAVE is the name of the data set that contains X.

 

proc summary data=have;
   var x;
   output out=stats (keep=mean std) mean=mean std=std;
run;

data want;
   if _n_=1 then do;
      set stats;
      upper_limit = mean + 3*std;
      lower_limit = mean - 3*std;
      retain upper_limit lower_limit;
   end;
   set have;
   if x > upper_limit then capped_x = upper_limit;
   else if . < x < lower_limit then capped_x = lower_limit;
   else capped_x = x;
run;

This will at least give you something to look at and consider.  If you want to expand this to process many variables, there is a lot of work to be done.  There is one variable MEAN and one variable STD.  With many variables, you need many names to hold these statistics.

Ksharp
Super User

You can't apply this outlier detecting way on a TIME SERIES data.

You need PROC ARIMMA .Check its documentation and its Example 8.7: Iterative Outlier Detection :

 

 

/*-- Outlier Detection --*/
proc arima data=airline;
identify var=logair( 1, 12 ) noprint;
estimate q= (1)(12) noint method= ml;
outlier maxnum=3 alpha=0.01;
run;
SurajSaini
Obsidian | Level 7
Here is SAS Macro that I created for outlier detection as per my requirements, more information visit here: https://seleritysas.com/blog/2020/12/10/sas-custom-macros-that-make-feature-engineering-easy-for-dat...
--------------------------------------- Macro Definition----------------------------------------------------------
%macro outliers(dat,var);
options nonotes;
proc univariate data=&dat normal noprint;
var &var;
output out=ttest normaltest=Test probn=P_Value;
run;
Data _Null_;
set ttest;
%if P_value > 0.05 %Then %do;
option notes;
%put NOTE: &var is normally distributed hence it select STD method to find Outliers.;
%put NOTE: You can check statistics and pvalue in work.ttest table;
options nonotes;
Proc SQL noprint;
Select Mean(&var)
into: me
from &dat;
select std(&var)
into:sd
from &dat;
quit;
run;
Data outlier;
set &dat;
%Let Min_cutoff= %sysevalf(&me - (3* &sd));
%Let Max_cutoff= %sysevalf(&me + (3* &sd));
where &var < &Min_cutoff or &var > &Max_cutoff;
run;
%end;
%else %do;
options notes;
%put NOTE: &var is not normally distributed hence it select percentile method to find Outliers.;
%put NOTE: You can check statistics and pvalue in work.ttest table & percentile values in work.ranges table;
options nonotes;
proc means data=&dat stackods n qrange p1 p99 ;
var &var;
ods output summary=ranges;
run;
proc sql noprint;
select P1 into:Min
from ranges;
select P99 into : Max
from Ranges;
quit;
run;
Data outliers;
set &dat;
Where &var < &Min or &var > &Max;
run;
%end;
options notes;
%mend;

------------------------------------------- Macro Testing ---------------------------------------------------------
options nomprint nomlogic nosymbolgen;
%outliers(Lib.dataset_name, Variable_Name)


SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2162 views
  • 1 like
  • 4 in conversation