- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a large dataset and I want to remove outliers. Before i proceed further, I've done some research online and below is my code:
proc sql noprint;
create table ppussu_sum as
select substationname
, substationcode
, voltagetype
, substationtype
, datetime
, sum(power_calculated) as sum_power
from ncpdm.ncp_loadpf_&srcnm._&period
where (substationtype = 'PPU' and feedertype = 'TXF')
or (substationtype = 'SSU' and feedertype = 'OF')
group by substationname, substationcode, voltagetype, substationtype, datetime
;
quit;
proc univariate data = ppussu_sum;
var sum_power;
output out=boxStats median=median qrange = iqr;
run;
data _null_;
set boxStats;
call symput ('median',median);
call symput ('iqr', iqr);
run;
%put &median;
%put &iqr;
data trimmed;
set ppussu_sum;
if (sum_power le &median + 1.5 * &iqr) and (sum_power ge &median - 1.5 * &iqr);
run;
As you all can see, first i use proc univariate and then remove based on the formula. Just wondering if this is actually the correct way of doing it or is there another more accurate/appropriate way.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What do you want to do with the data set without outliers? If you want compute some statistic or run some regression, the appropriate SAS procedure usually has options to robustly deal with outliers directly
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
without outliers, data wont have sudden spike when displaying in visual dashboard. This is the only reason why i want to remove it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Note that there is an alternative approach ... capping the outliers instead of removing them. For example:
if (sum_power > &median + 1.5 * &iqr) then sum_power = &median + 1.5 * &iqr);