Hi,
I have a large dataset and I want to remove outliers. Before i proceed further, I've done some research online and below is my code:
proc sql noprint;
create table ppussu_sum as
select substationname
, substationcode
, voltagetype
, substationtype
, datetime
, sum(power_calculated) as sum_power
from ncpdm.ncp_loadpf_&srcnm._&period
where (substationtype = 'PPU' and feedertype = 'TXF')
or (substationtype = 'SSU' and feedertype = 'OF')
group by substationname, substationcode, voltagetype, substationtype, datetime
;
quit;
proc univariate data = ppussu_sum;
var sum_power;
output out=boxStats median=median qrange = iqr;
run;
data _null_;
set boxStats;
call symput ('median',median);
call symput ('iqr', iqr);
run;
%put &median;
%put &iqr;
data trimmed;
set ppussu_sum;
if (sum_power le &median + 1.5 * &iqr) and (sum_power ge &median - 1.5 * &iqr);
run;
As you all can see, first i use proc univariate and then remove based on the formula. Just wondering if this is actually the correct way of doing it or is there another more accurate/appropriate way.
What do you want to do with the data set without outliers? If you want compute some statistic or run some regression, the appropriate SAS procedure usually has options to robustly deal with outliers directly
without outliers, data wont have sudden spike when displaying in visual dashboard. This is the only reason why i want to remove it.
Note that there is an alternative approach ... capping the outliers instead of removing them. For example:
if (sum_power > &median + 1.5 * &iqr) then sum_power = &median + 1.5 * &iqr);
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.