I am trying to compare observations to each other so that I can find outliers in my dataset.
My dataset looks somewhat like the following:
var1 var2 var3 var4
obs1 1 2 3
obs2 3 4 5
obs3 4 5 6
llimit -1 -2 -3
ulimit 5 6 7
Now my actual dataset has many more observations and many more variables. I am wondering though if there is anyway that I can compare the numbers that fall under the variables to see if they are outliers and to output them by variable. Thanks for the help.
Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :
data have;
input var1 $ var2-var4;
datalines;
obs1 1 2 3
obs2 3 -4 5
obs3 6 5 6
llimit -1 -2 -3
ulimit 5 6 7
;
proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;
proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
case when a.col1 > u.col1 then "High"
when a.col1 < l.col1 then "Low"
else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
u.var1="ulimit" and l.var1="llimit" and
calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;
PG
hi ... how do you define "outlier"
#1 beyond +/- two standard deviations from the mean
or ...
#2 the John Tukey box plot idea ... beyond +/- 1.5 x inter-quartile range
or ...
#3 something else
John Tukey Box plot idea
Are the outliers defined by the llimit and ulimit values for each variable? - PG
Yes, they are defined by ulimit and llimit.
I would use PROC SQL to "self-join" the file. Here's one solution, for var2. If this is what you are after, a macro could be written to loop though the variables you want to compare.
proc sql;
create table lower_outliers_var2 as
select
a.var1,
a.var2
from
original_file as a,
(select var1, var2 from original_file where var1="llimit") as b
where
a.var1 not in ("llimit", "ulimit") and
a.var2 < b.var2;
create table upper_outliers_var2 as
select
a.var1,
a.var2
from
original_file as a,
(select var1, var2 from original_file where var1="ulimit") as b
where
a.var1 not in ("llimit", "ulimit") and
a.var2 > b.var2;
quit;
Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :
data have;
input var1 $ var2-var4;
datalines;
obs1 1 2 3
obs2 3 -4 5
obs3 6 5 6
llimit -1 -2 -3
ulimit 5 6 7
;
proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;
proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
case when a.col1 > u.col1 then "High"
when a.col1 < l.col1 then "Low"
else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
u.var1="ulimit" and l.var1="llimit" and
calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;
PG
I am not very familiar with proc sql, meaning basically that I have never used it. I am sure that this will be the answer that will get me to where I am trying to go if maybe you could help me out a little bit. I am not sure if this is what you were going for but I removed part of the proc transpose sequence because it was only giving me one column of data when there are about 100 in the actual data set. As far as using the "a.var1" " a.col1" int the proc sql sequence I am a little lost how I should convert them so that they work for me?
Perhaps I could help you further if you suggested some format/organisation for the output you expect. What my code is producing should look like :
var1 obs3 6 High
var3 obs2 -4 Low
for my example data, which identifies which variable is Low/High for which observation. Low/High being defined by the llimit/ulimit values for each variable.
PG
The idea that you have as far as output is concerned is fine, I am just trying to figure out how I can get it to work considering that I am using variables with names other than var1 - var4. I am mainly confused about the section below:
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
case when a.col1 > u.col1 then "High"
when a.col1 < l.col1 then "Low"
else "" end as warning
I am confused about what "a." is supposed to be and what "a.col1" is.
EDIT:
Ok so I think that I have figured out where my problem is, I am only getting 2 columns when I am supposed to be getting three. I am assuming that is because I have already transposed it so that I could get the llimit and ulimit, therefore already having the variable "_name_".
You need to back up a bit, you said you modified the proc transpose and that's what's causing a big part of the issue. You're only supposed to end up with one column after proc transpose.
The method proposed switches the format of the data to give you a list of outliers and their values. If your variables are beside each other in the table specify the variable list as first_variable -- last_variable instead of var1-var4
a. is called an alias and refers to the table from the following line:
from haveList as a, havelist as u, havelist as l
so a.col1 is col1 from table a, or havelist. It's a shorthand notation that keeps you from having to type out the full name.
Reeza you were right I figured that out awhile ago but I hadn't posted it yet, and I figured out that I had misspelled something which is why it wasn't working. Thank you for all of the help. My last question is, is there anyway that I could get it to omit missing values?
Missing values can be omitted early:
replace out=havelist with out=havelist(where=(col1 ne .)) in the transpose statement
or late :
add and a.col1 is not missing after calculated warning is not missing.
PG
How detailed do you want to get?
PROC UNIVARIATE with the ID parameter will print top and bottom ten values with the associated id.
proc univariate data=sashelp.class;
id name;
var height;
run;
The UNIVARIATE Procedure
Variable: Height
Extreme Observations
----------Lowest--------- ---------Highest---------
Value Name Obs Value Name Obs
51.3 Joyce 11 66.5 Mary 14
56.3 Louise 13 66.5 William 19
56.5 Alice 2 67.0 Ronald 17
57.3 James 6 69.0 Alfred 1
57.5 Thomas 18 72.0 Philip 15
I am trying to get specifically outliers. But also with proc univariate you can put nextrobs = (some number) after data and it will give you however many you put there to get that many on your lowest and highest. So if you put "nextrobs = 10" you will get the 10 highest and lowest..
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.