Solved: Re: comparing observations

Hawkeye · Posted 07-19-2012 01:33 PM

I am trying to compare observations to each other so that I can find outliers in my dataset.

My dataset looks somewhat like the following:

var1 var2 var3 var4

obs1 1 2 3

obs2 3 4 5

obs3 4 5 6

llimit -1 -2 -3

ulimit 5 6 7

Now my actual dataset has many more observations and many more variables. I am wondering though if there is anyway that I can compare the numbers that fall under the variables to see if they are outliers and to output them by variable. Thanks for the help.

PGStats · Posted 07-19-2012 05:18 PM

Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :

data have;
input var1 $ var2-var4;
datalines;
obs1     1        2          3
obs2     3        -4          5
obs3      6       5         6
llimit     -1       -2         -3
ulimit     5     6          7
;

proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;

proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
     case when a.col1 > u.col1 then "High"
          when a.col1 < l.col1 then "Low"
          else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
     u.var1="ulimit" and l.var1="llimit" and
     calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;

PG

View solution in original post

MikeZdeb · Posted 07-19-2012 02:46 PM

hi ... how do you define "outlier"

#1 beyond +/- two standard deviations from the mean

or ...

#2 the John Tukey box plot idea ... beyond +/- 1.5 x inter-quartile range

or ...

#3 something else

Hawkeye · Posted 07-19-2012 04:47 PM

John Tukey Box plot idea

PGStats · Posted 07-19-2012 04:13 PM

Are the outliers defined by the llimit and ulimit values for each variable? - PG

PG

Hawkeye · Posted 07-19-2012 04:48 PM

Yes, they are defined by ulimit and llimit.

tish · Posted 07-19-2012 04:26 PM

I would use PROC SQL to "self-join" the file. Here's one solution, for var2. If this is what you are after, a macro could be written to loop though the variables you want to compare.

proc sql;

create table lower_outliers_var2 as

select

a.var1,

a.var2

from

original_file as a,

(select var1, var2 from original_file where var1="llimit") as b

where

a.var1 not in ("llimit", "ulimit") and

a.var2 < b.var2;

create table upper_outliers_var2 as

select

a.var1,

a.var2

from

original_file as a,

(select var1, var2 from original_file where var1="ulimit") as b

where

a.var1 not in ("llimit", "ulimit") and

a.var2 > b.var2;

quit;

PGStats · Posted 07-19-2012 05:18 PM

Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :

data have;
input var1 $ var2-var4;
datalines;
obs1     1        2          3
obs2     3        -4          5
obs3      6       5         6
llimit     -1       -2         -3
ulimit     5     6          7
;

proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;

proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
     case when a.col1 > u.col1 then "High"
          when a.col1 < l.col1 then "Low"
          else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
     u.var1="ulimit" and l.var1="llimit" and
     calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;

PG

Hawkeye · Posted 07-23-2012 03:02 PM

I am not very familiar with proc sql, meaning basically that I have never used it. I am sure that this will be the answer that will get me to where I am trying to go if maybe you could help me out a little bit. I am not sure if this is what you were going for but I removed part of the proc transpose sequence because it was only giving me one column of data when there are about 100 in the actual data set. As far as using the "a.var1" " a.col1" int the proc sql sequence I am a little lost how I should convert them so that they work for me?

PGStats · Posted 07-23-2012 03:54 PM

Perhaps I could help you further if you suggested some format/organisation for the output you expect. What my code is producing should look like :

var1 obs3 6 High

var3 obs2 -4 Low

for my example data, which identifies which variable is Low/High for which observation. Low/High being defined by the llimit/ulimit values for each variable.

PG

Hawkeye · Posted 07-23-2012 04:04 PM

The idea that you have as far as output is concerned is fine, I am just trying to figure out how I can get it to work considering that I am using variables with names other than var1 - var4. I am mainly confused about the section below:

select a._name_ as Variable label="Variable", a.var1, a.col1 as value,

case when a.col1 > u.col1 then "High"

when a.col1 < l.col1 then "Low"

else "" end as warning

I am confused about what "a." is supposed to be and what "a.col1" is.

EDIT:

Ok so I think that I have figured out where my problem is, I am only getting 2 columns when I am supposed to be getting three. I am assuming that is because I have already transposed it so that I could get the llimit and ulimit, therefore already having the variable "_name_".

Reeza · Posted 07-23-2012 04:19 PM

You need to back up a bit, you said you modified the proc transpose and that's what's causing a big part of the issue. You're only supposed to end up with one column after proc transpose.

The method proposed switches the format of the data to give you a list of outliers and their values. If your variables are beside each other in the table specify the variable list as first_variable -- last_variable instead of var1-var4

a. is called an alias and refers to the table from the following line:

from haveList as a, havelist as u, havelist as l

so a.col1 is col1 from table a, or havelist. It's a shorthand notation that keeps you from having to type out the full name.

Hawkeye · Posted 07-23-2012 04:31 PM

Reeza you were right I figured that out awhile ago but I hadn't posted it yet, and I figured out that I had misspelled something which is why it wasn't working. Thank you for all of the help. My last question is, is there anyway that I could get it to omit missing values?

PGStats · Posted 07-23-2012 04:42 PM

Missing values can be omitted early:

replace out=havelist with out=havelist(where=(col1 ne .)) in the transpose statement

or late :

add and a.col1 is not missing after calculated warning is not missing.

PG

Tom · Posted 07-23-2012 03:09 PM

How detailed do you want to get?

PROC UNIVARIATE with the ID parameter will print top and bottom ten values with the associated id.

proc univariate data=sashelp.class;

id name;

var height;

run;

The UNIVARIATE Procedure

Variable: Height

Extreme Observations

----------Lowest--------- ---------Highest---------

Value Name Obs Value Name Obs

51.3 Joyce 11 66.5 Mary 14

56.3 Louise 13 66.5 William 19

56.5 Alice 2 67.0 Ronald 17

57.3 James 6 69.0 Alfred 1

57.5 Thomas 18 72.0 Philip 15

Hawkeye · Posted 07-23-2012 03:17 PM

I am trying to get specifically outliers. But also with proc univariate you can put nextrobs = (some number) after data and it will give you however many you put there to get that many on your lowest and highest. So if you put "nextrobs = 10" you will get the 10 highest and lowest..

Registration is open

SAS Training: Just a Click Away