- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am wondering if I can use this code (please see below) for removing outliers from my data set. I am not familiar with macro at all. I tried this and it looks like it would work if I made modifications. Explaining the variables I used as a test:
variable a: freq 6-41 (only outlier is 112)
var b freq is 3-47 (outliers are 0 135 340 and 90750)
The code below removes everything that is >16 for var a (var a freq is 6-16 in my outfile) and for var b it removes everything that is >47 and it actually also removes anything that is < 21 (my outfile shows freq 21-47 for var b)
Probably has to do with the P25 and P75 below? How can I change the code to remove any outliers like 0 and anything that is higher than the max value for each var (which will vary considerably per var. I also have a lot more than only 2 vars).
For example if var a max value is 41 anything higher than that would be removed. If the max for var b is 47 anything higher than that would be removed.
Var c and d, if added to the code below, sas tells me it does not work: variable c freq 26.4 - 104, it has a lot of values with decimal places like 26.4 but also numbers without decimals. I would removed anything higher than the 104 though but I need the values with decimals to stay.
Var d freq 11.7 - 400 but the decimals here are outliers...there should be no decimals in this var - freq should be only in the 100's and I have one random 11.7 that would need to be removed. The code below does not "like" the values with decimals.
Is there a way to make changes in this code for what I need or should I do this in a completely different way? Thank you !
The code that I found and tried is below:
%macro outliers(input=, var=, output= );
%let Q1=;
%let Q3=;
%let varL=;
%let varH=;
%let n=%sysfunc(countw(&var));
%do i= 1 %to &n;
%let val = %scan(&var,&i);
%let Q1 = &Q1 &val._P25;
%let Q3 = &Q3 &val._P75;
%let varL = &varL &val.L;
%let varH = &varH &val.H;
%end;
/* Calculate the quartiles and inter-quartile range using proc univariate */
proc means data=&input nway noprint;
var &var;
output out=temp P25= P75= / autoname;
run;
/* Extract the upper and lower limits into macro variables */
data temp;
set temp;
ID = 1;
array varb(&n) &Q1;
array varc(&n) &Q3;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(varb);
lower(i) = varb(i) - 3 * (varc(i) - varb(i));
upper(i) = varc(i) + 3 * (varc(i) - varb(i));
end;
drop i _type_ _freq_;
run;
data temp1;
set &input;
ID = 1;
run;
data &output;
merge temp1 temp;
by ID;
array var(&n) &var;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(var);
if not missing(var(i)) then do;
if var(i) >= lower(i) and var(i) <= upper(i);
end;
end;
drop &Q1 &Q3 &varL &varH ID i;
run;
%mend;
%outliers(input=abfile, var= a b, output= outabfile);
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Var c and d, if added to the code below, sas tells me it does not work: variable c freq 26.4 - 104, it has a lot of values with decimal places like 26.4 but also numbers without decimals. I would removed anything higher than the 104 though but I need the values with decimals to stay.
Either I am misunderstanding, or this doesn't make sense. You want 105 to be removed, but 105.1 to not be removed?
Var d freq 11.7 - 400 but the decimals here are outliers...there should be no decimals in this var - freq should be only in the 100's and I have one random 11.7 that would need to be removed. The code below does not "like" the values with decimals.
Again, I don't understand. And I don't know what you mean by "code below does not 'like' the values with decimals". The word "like" means nothing in a mathematical/statistical/programming sense.
We can't see what you are seeing, perhaps showing us some of the data and some of the results might help.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm also confused because the macro says to use PROC UNIVARIATE, but proceeds to use PROC MEANS to calculate the 25th and 75th percentiles.
Is there a reason you're not just using PROC UNIVARIATE? It can identify these values for you, but I may be misunderstanding something completely.
Here are some resources:
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Mscarboncopy
Please have a look at this paper https://www.lexjansen.com/phuse/2011/cc/CC01.pdf .
This may have a solution for you.