## comparing observations

Solved
Occasional Contributor
Posts: 17

# comparing observations

I am trying to compare observations to each other so that I can find outliers in my dataset.

My dataset looks somewhat like the following:

var1     var2     var3      var4

obs1     1        2          3

obs2     3        4          5

obs3      4       5         6

llimit     -1       -2         -3

ulimit     5     6          7

Now my actual dataset has many more observations and many more variables.  I am wondering though if there is anyway that I can compare the numbers that fall under the variables to see if they are outliers and to output them by variable. Thanks for the help.

Accepted Solutions
Solution
‎07-19-2012 05:18 PM
Posts: 5,535

## Re: comparing observations

Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :

data have;
input var1 \$ var2-var4;
datalines;
obs1     1        2          3
obs2     3        -4          5
obs3      6       5         6
llimit     -1       -2         -3
ulimit     5     6          7
;

proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;

proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
case when a.col1 > u.col1 then "High"
when a.col1 < l.col1 then "Low"
else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
u.var1="ulimit" and l.var1="llimit" and
calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;

PG

PG

All Replies
Valued Guide
Posts: 765

## Re: comparing observations

hi ... how do you define "outlier"

#1  beyond +/- two standard deviations from the mean

or ...

#2  the John Tukey box plot idea ... beyond +/- 1.5 x inter-quartile range

or ...

#3  something else

Occasional Contributor
Posts: 17

## Re: comparing observations

John Tukey Box plot idea

Posts: 5,535

## Re: comparing observations

Are the outliers defined by the llimit and ulimit values for each variable? - PG

PG
Occasional Contributor
Posts: 17

## Re: comparing observations

Yes, they are defined by ulimit and llimit.

Contributor
Posts: 52

## Re: comparing observations

I would use PROC SQL to "self-join" the file. Here's one solution, for var2. If this is what you are after, a macro could be written to loop though the variables you want to compare.

proc sql;

create table lower_outliers_var2 as

select

a.var1,

a.var2

from

original_file as a,

(select var1, var2 from original_file where var1="llimit") as b

where

a.var1 not in ("llimit", "ulimit") and

a.var2 < b.var2;

create table upper_outliers_var2 as

select

a.var1,

a.var2

from

original_file as a,

(select var1, var2 from original_file where var1="ulimit") as b

where

a.var1 not in ("llimit", "ulimit") and

a.var2 > b.var2;

quit;

Solution
‎07-19-2012 05:18 PM
Posts: 5,535

## Re: comparing observations

Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :

data have;
input var1 \$ var2-var4;
datalines;
obs1     1        2          3
obs2     3        -4          5
obs3      6       5         6
llimit     -1       -2         -3
ulimit     5     6          7
;

proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;

proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
case when a.col1 > u.col1 then "High"
when a.col1 < l.col1 then "Low"
else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
u.var1="ulimit" and l.var1="llimit" and
calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;

PG

PG
Occasional Contributor
Posts: 17

## Re: comparing observations

I am not very familiar with proc sql, meaning basically that I have never used it.  I am sure that this will be the answer that will get me to where I am trying to go if maybe you could help me out a little bit.  I am not sure if this is what you were going for but I removed part of the proc transpose sequence because it was only giving me one column of data when there are about 100 in the actual data set. As far as using the "a.var1" " a.col1" int the proc sql sequence I am a little lost how I should convert them so that they work for me?

Posts: 5,535

## Re: comparing observations

Perhaps I could help you further if you suggested some format/organisation for the output you expect. What my code is producing should look like :

var1 obs3 6 High

var3 obs2 -4 Low

for my example data, which identifies which variable is Low/High for which observation. Low/High being defined by the llimit/ulimit values for each variable.

PG

PG
Occasional Contributor
Posts: 17

## Re: comparing observations

The idea that you have as far as output is concerned is fine, I am just trying to figure out how I can get it to work considering that I am using variables with names other than var1 - var4.  I am mainly confused about the section below:

```select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
case when a.col1 > u.col1 then "High"
when a.col1 < l.col1 then "Low"
else "" end as warning

```

I am confused about what "a." is supposed to be and what "a.col1" is.

EDIT:

Ok so I think that I have figured out where my problem is, I am only getting 2 columns when I am supposed to be getting three.  I am assuming that is because I have already transposed it so that I could get the llimit and ulimit, therefore already having the variable "_name_".

Super User
Posts: 23,754

## Re: comparing observations

You need to back up a bit, you said you modified the proc transpose and that's what's causing a big part of the issue. You're only supposed to end up with one column after proc transpose.

The method proposed switches the format of the data to give you a list of outliers and their values. If your variables are beside each other in the table specify the variable list as first_variable -- last_variable instead of var1-var4

a. is called an alias and refers to the table from the following line:

from haveList as a, havelist as u, havelist as l

so a.col1 is col1 from table a, or havelist. It's a shorthand notation that keeps you from having to type out the full name.

Occasional Contributor
Posts: 17

## Re: comparing observations

Reeza you were right I figured that out awhile ago but I hadn't posted it yet, and I figured out that I had misspelled something which is why it wasn't working. Thank you for all of the help.  My last question is, is there anyway that I could get it to omit missing values?

Posts: 5,535

## Re: comparing observations

Missing values can be omitted early:

replace out=havelist with out=havelist(where=(col1 ne .)) in the transpose statement

or late :

add and a.col1 is not missing after calculated warning is not missing.

PG

PG
Super User
Posts: 8,115

## Re: comparing observations

How detailed do you want to get?

PROC UNIVARIATE with the ID parameter will print top and bottom ten values with the associated id.

proc univariate data=sashelp.class;

id name;

var height;

run;

The UNIVARIATE Procedure

Variable:  Height

Extreme Observations

----------Lowest---------        ---------Highest---------

Value   Name          Obs        Value   Name          Obs

51.3   Joyce          11         66.5   Mary           14

56.3   Louise         13         66.5   William        19

56.5   Alice           2         67.0   Ronald         17

57.3   James           6         69.0   Alfred          1

57.5   Thomas         18         72.0   Philip         15

Occasional Contributor
Posts: 17

## Re: comparing observations

I am trying to get specifically outliers.  But also with proc univariate you can put nextrobs = (some number) after data and it will give you however many you put there to get that many on your lowest and highest.  So if you put "nextrobs = 10" you will get the 10 highest and lowest..

🔒 This topic is solved and locked.

Discussion stats
• 14 replies
• 482 views
• 3 likes
• 6 in conversation