BookmarkSubscribeRSS Feed
Mscarboncopy
Pyrite | Level 9

I am wondering if I can use this code (please see below) for removing outliers from my data set. I am not familiar with macro at all. I tried this and it looks like it would work if I made modifications. Explaining the variables I used as a test:

 

 variable a: freq 6-41 (only outlier is  112)

var b freq is 3-47 (outliers are 0 135 340 and 90750)

The code below  removes everything that is >16 for var a (var a  freq is 6-16 in my outfile) and for var b it removes everything that is >47 and it actually also removes anything that is < 21 (my outfile shows freq 21-47 for var b)

Probably has to do with the P25 and P75 below? How can I change the code  to remove any outliers like 0 and anything that is higher than the max value for each var (which will vary considerably per var. I also have a lot more than only 2 vars).

 

For example if var a max value is 41 anything higher than that would be removed.  If the max for var b is 47 anything higher than that would be removed.

 

Var c and d,  if added to the code below, sas tells me it does not work:  variable c freq 26.4 - 104, it has a lot of values with decimal places like 26.4 but also numbers without decimals. I would removed anything higher than the 104 though but I need the values with decimals to stay.

 

Var d freq 11.7 - 400 but the decimals here are outliers...there should be no decimals in this var -  freq should be only in the 100's and I have one random 11.7 that would need to be removed. The code below does not "like" the values with decimals.

Is there a way to make changes in this code for what I need or should I do this in a completely different way? Thank you !

 

The code that I found and tried is below:


%macro outliers(input=, var=, output= );

%let Q1=;
%let Q3=;
%let varL=;
%let varH=;

%let n=%sysfunc(countw(&var));
%do i= 1 %to &n;
%let val = %scan(&var,&i);
%let Q1 = &Q1 &val._P25;
%let Q3 = &Q3 &val._P75;
%let varL = &varL &val.L;
%let varH = &varH &val.H;
%end;

/* Calculate the quartiles and inter-quartile range using proc univariate */
proc means data=&input nway noprint;
var &var;
output out=temp P25= P75= / autoname;
run;

/* Extract the upper and lower limits into macro variables */
data temp;
set temp;
ID = 1;
array varb(&n) &Q1;
array varc(&n) &Q3;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(varb);
lower(i) = varb(i) - 3 * (varc(i) - varb(i));
upper(i) = varc(i) + 3 * (varc(i) - varb(i));
end;
drop i _type_ _freq_;
run;

data temp1;
set &input;
ID = 1;
run;

data &output;
merge temp1 temp;
by ID;
array var(&n) &var;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(var);
if not missing(var(i)) then do;
if var(i) >= lower(i) and var(i) <= upper(i);
end;
end;
drop &Q1 &Q3 &varL &varH ID i;
run;
%mend;

%outliers(input=abfile, var= a  b, output= outabfile);

3 REPLIES 3
PaigeMiller
Diamond | Level 26

Var c and d,  if added to the code below, sas tells me it does not work:  variable c freq 26.4 - 104, it has a lot of values with decimal places like 26.4 but also numbers without decimals. I would removed anything higher than the 104 though but I need the values with decimals to stay.

Either I am misunderstanding, or this doesn't make sense. You want 105 to be removed, but 105.1 to not be removed?

 

Var d freq 11.7 - 400 but the decimals here are outliers...there should be no decimals in this var -  freq should be only in the 100's and I have one random 11.7 that would need to be removed. The code below does not "like" the values with decimals.

Again, I don't understand. And I don't know what you mean by "code below does not 'like' the values with decimals". The word "like" means nothing in a mathematical/statistical/programming sense.

 

We can't see what you are seeing, perhaps showing us some of the data and some of the results might help.

 

 

--
Paige Miller
maguiremq
SAS Super FREQ

I'm also confused because the macro says to use PROC UNIVARIATE, but proceeds to use PROC MEANS to calculate the 25th and 75th percentiles.

 

Is there a reason you're not just using PROC UNIVARIATE? It can identify these values for you, but I may be misunderstanding something completely.

 

Here are some resources:

Identifying Extreme Observations and Extreme Values 

Data cleaning and spotting outliers with UNIVARIATE 

Sajid01
Meteorite | Level 14

Hello @Mscarboncopy 
Please have a look at this paper https://www.lexjansen.com/phuse/2011/cc/CC01.pdf .
This may have a solution for you.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 698 views
  • 0 likes
  • 4 in conversation