Hello,
My data is a time series with multiple variables by districts. (District |date| Variable1)
I looked online to fine a solution to dealing with outliers. I found one saying to calculate the IQR (interquantile range Q3 minus Q1) then multiply by 1.5 and adding the amount to Q3 and substracting that from Q1 (lower limit). But I am not sure how to actually code it to produce a output data without outliers.
I also found one suggesting the following code:
proc univariate data=" " robustscale plot;
var varname;
run;
I tried it and produces results like this among other graphs for this variable.
Quantiles (Definition 5) | |
Level | Quantile |
100% Max | 1714.8982 |
99% | 300.1324 |
95% | 117.2804 |
90% | 75.9922 |
75% Q3 | 35.1522 |
50% Median | 13.0514 |
25% Q1 | 0.0000 |
10% | 0.0000 |
5% | 0.0000 |
1% | 0.0000 |
0% Min | 0.0000 |
Now that I have this information, how can I code for it to remove or treat outliers in my dataset with a code? I can't seem to find that code.
Many people on this board could do the programming. But you have to do the hard part. What makes a value an outlier? There are many ways to answer the question, and an answer is required before programming can begin.
The most flexible way I have used defines outliers as more than X standard deviations above the mean, or less than X standard deviations below the mean. "X" can actually be flexible and can be a parameter fed to a macro. But there are many other plausible definitions and it is up to you to pick one if you want help with the programming.
If I want to define outlier as more or less than 3 standard deviation from the mean. Could you help me with the programming for this definition?
Here is what I hope is working code (it's untested). It assumes X is the name of the variable you want to cap, and HAVE is the name of the data set that contains X.
proc summary data=have;
var x;
output out=stats (keep=mean std) mean=mean std=std;
run;
data want;
if _n_=1 then do;
set stats;
upper_limit = mean + 3*std;
lower_limit = mean - 3*std;
retain upper_limit lower_limit;
end;
set have;
if x > upper_limit then capped_x = upper_limit;
else if . < x < lower_limit then capped_x = lower_limit;
else capped_x = x;
run;
This will at least give you something to look at and consider. If you want to expand this to process many variables, there is a lot of work to be done. There is one variable MEAN and one variable STD. With many variables, you need many names to hold these statistics.
You can't apply this outlier detecting way on a TIME SERIES data.
You need PROC ARIMMA .Check its documentation and its Example 8.7: Iterative Outlier Detection :
/*-- Outlier Detection --*/
proc arima data=airline;
identify var=logair( 1, 12 ) noprint;
estimate q= (1)(12) noint method= ml;
outlier maxnum=3 alpha=0.01;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.