removing outliers

Mscarboncopy · Posted 09-09-2021 08:31 AM

I am wondering if I can use this code (please see below) for removing outliers from my data set. I am not familiar with macro at all. I tried this and it looks like it would work if I made modifications. Explaining the variables I used as a test:

variable a: freq 6-41 (only outlier is 112)

var b freq is 3-47 (outliers are 0 135 340 and 90750)

The code below removes everything that is >16 for var a (var a freq is 6-16 in my outfile) and for var b it removes everything that is >47 and it actually also removes anything that is < 21 (my outfile shows freq 21-47 for var b)

Probably has to do with the P25 and P75 below? How can I change the code to remove any outliers like 0 and anything that is higher than the max value for each var (which will vary considerably per var. I also have a lot more than only 2 vars).

For example if var a max value is 41 anything higher than that would be removed. If the max for var b is 47 anything higher than that would be removed.

Var c and d, if added to the code below, sas tells me it does not work: variable c freq 26.4 - 104, it has a lot of values with decimal places like 26.4 but also numbers without decimals. I would removed anything higher than the 104 though but I need the values with decimals to stay.

Var d freq 11.7 - 400 but the decimals here are outliers...there should be no decimals in this var - freq should be only in the 100's and I have one random 11.7 that would need to be removed. The code below does not "like" the values with decimals.

Is there a way to make changes in this code for what I need or should I do this in a completely different way? Thank you !

The code that I found and tried is below:

%macro outliers(input=, var=, output= );

%let Q1=;
%let Q3=;
%let varL=;
%let varH=;

%let n=%sysfunc(countw(&var));
%do i= 1 %to &n;
%let val = %scan(&var,&i);
%let Q1 = &Q1 &val._P25;
%let Q3 = &Q3 &val._P75;
%let varL = &varL &val.L;
%let varH = &varH &val.H;
%end;

/* Calculate the quartiles and inter-quartile range using proc univariate */
proc means data=&input nway noprint;
var &var;
output out=temp P25= P75= / autoname;
run;

/* Extract the upper and lower limits into macro variables */
data temp;
set temp;
ID = 1;
array varb(&n) &Q1;
array varc(&n) &Q3;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(varb);
lower(i) = varb(i) - 3 * (varc(i) - varb(i));
upper(i) = varc(i) + 3 * (varc(i) - varb(i));
end;
drop i _type_ _freq_;
run;

data temp1;
set &input;
ID = 1;
run;

data &output;
merge temp1 temp;
by ID;
array var(&n) &var;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(var);
if not missing(var(i)) then do;
if var(i) >= lower(i) and var(i) <= upper(i);
end;
end;
drop &Q1 &Q3 &varL &varH ID i;
run;
%mend;

%outliers(input=abfile, var= a b, output= outabfile);

PaigeMiller · Posted 09-09-2021 08:39 AM

Var c and d, if added to the code below, sas tells me it does not work: variable c freq 26.4 - 104, it has a lot of values with decimal places like 26.4 but also numbers without decimals. I would removed anything higher than the 104 though but I need the values with decimals to stay.

Either I am misunderstanding, or this doesn't make sense. You want 105 to be removed, but 105.1 to not be removed?

Var d freq 11.7 - 400 but the decimals here are outliers...there should be no decimals in this var - freq should be only in the 100's and I have one random 11.7 that would need to be removed. The code below does not "like" the values with decimals.

Again, I don't understand. And I don't know what you mean by "code below does not 'like' the values with decimals". The word "like" means nothing in a mathematical/statistical/programming sense.

We can't see what you are seeing, perhaps showing us some of the data and some of the results might help.

--
Paige Miller