BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Hawkeye
Calcite | Level 5

I am trying to compare observations to each other so that I can find outliers in my dataset.

My dataset looks somewhat like the following:

var1     var2     var3      var4

obs1     1        2          3

obs2     3        4          5

obs3      4       5         6

llimit     -1       -2         -3

ulimit     5     6          7

Now my actual dataset has many more observations and many more variables.  I am wondering though if there is anyway that I can compare the numbers that fall under the variables to see if they are outliers and to output them by variable. Thanks for the help.

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :

data have;
input var1 $ var2-var4;
datalines;
obs1     1        2          3
obs2     3        -4          5
obs3      6       5         6
llimit     -1       -2         -3
ulimit     5     6          7
;

proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;

proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
     case when a.col1 > u.col1 then "High"
          when a.col1 < l.col1 then "Low"
          else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
     u.var1="ulimit" and l.var1="llimit" and
     calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;

PG

PG

View solution in original post

14 REPLIES 14
MikeZdeb
Rhodochrosite | Level 12

hi ... how do you define "outlier"

#1  beyond +/- two standard deviations from the mean

or ...

#2  the John Tukey box plot idea ... beyond +/- 1.5 x inter-quartile range

or ...

#3  something else

Hawkeye
Calcite | Level 5

John Tukey Box plot idea

PGStats
Opal | Level 21

Are the outliers defined by the llimit and ulimit values for each variable? - PG

PG
Hawkeye
Calcite | Level 5

Yes, they are defined by ulimit and llimit.

tish
Calcite | Level 5

I would use PROC SQL to "self-join" the file. Here's one solution, for var2. If this is what you are after, a macro could be written to loop though the variables you want to compare.

proc sql;

   create table lower_outliers_var2 as

      select

         a.var1,

         a.var2

      from

         original_file as a,

         (select var1, var2 from original_file where var1="llimit") as b

      where

         a.var1 not in ("llimit", "ulimit") and

         a.var2 < b.var2;

        

   create table upper_outliers_var2 as

      select

         a.var1,

         a.var2

      from

         original_file as a,

         (select var1, var2 from original_file where var1="ulimit") as b

      where

         a.var1 not in ("llimit", "ulimit") and

         a.var2 > b.var2;        

quit;

PGStats
Opal | Level 21

Not sure exactly what format you want, but this gives you a list of outliers (I modified your example data to include two outliers) :

data have;
input var1 $ var2-var4;
datalines;
obs1     1        2          3
obs2     3        -4          5
obs3      6       5         6
llimit     -1       -2         -3
ulimit     5     6          7
;

proc transpose data=have out=havelist;
by var1 notsorted;
var _numeric_; /* or var2-var4, if you prefer hard coded names */
run;

proc sql;
create table want as
select a._name_ as Variable label="Variable", a.var1, a.col1 as value,
     case when a.col1 > u.col1 then "High"
          when a.col1 < l.col1 then "Low"
          else "" end as warning
from haveList as a, havelist as u, havelist as l
where a._name_=u._name_ and a._name_=l._name_ and
     u.var1="ulimit" and l.var1="llimit" and
     calculated warning is not missing
order by variable, a.var1;
select * from want;
quit;

PG

PG
Hawkeye
Calcite | Level 5

I am not very familiar with proc sql, meaning basically that I have never used it.  I am sure that this will be the answer that will get me to where I am trying to go if maybe you could help me out a little bit.  I am not sure if this is what you were going for but I removed part of the proc transpose sequence because it was only giving me one column of data when there are about 100 in the actual data set. As far as using the "a.var1" " a.col1" int the proc sql sequence I am a little lost how I should convert them so that they work for me?

PGStats
Opal | Level 21

Perhaps I could help you further if you suggested some format/organisation for the output you expect. What my code is producing should look like :

var1 obs3 6 High

var3 obs2 -4 Low

for my example data, which identifies which variable is Low/High for which observation. Low/High being defined by the llimit/ulimit values for each variable.

 

PG


PG
Hawkeye
Calcite | Level 5

The idea that you have as far as output is concerned is fine, I am just trying to figure out how I can get it to work considering that I am using variables with names other than var1 - var4.  I am mainly confused about the section below:

select a._name_ as Variable label="Variable", a.var1, a.col1 as value,

     case when a.col1 > u.col1 then "High"

          when a.col1 < l.col1 then "Low"

          else "" end as warning

I am confused about what "a." is supposed to be and what "a.col1" is.

EDIT:

Ok so I think that I have figured out where my problem is, I am only getting 2 columns when I am supposed to be getting three.  I am assuming that is because I have already transposed it so that I could get the llimit and ulimit, therefore already having the variable "_name_". 

Reeza
Super User

You need to back up a bit, you said you modified the proc transpose and that's what's causing a big part of the issue. You're only supposed to end up with one column after proc transpose.

The method proposed switches the format of the data to give you a list of outliers and their values. If your variables are beside each other in the table specify the variable list as first_variable -- last_variable instead of var1-var4

a. is called an alias and refers to the table from the following line:

from haveList as a, havelist as u, havelist as l


so a.col1 is col1 from table a, or havelist. It's a shorthand notation that keeps you from having to type out the full name.




Hawkeye
Calcite | Level 5

Reeza you were right I figured that out awhile ago but I hadn't posted it yet, and I figured out that I had misspelled something which is why it wasn't working. Thank you for all of the help.  My last question is, is there anyway that I could get it to omit missing values?

PGStats
Opal | Level 21

Missing values can be omitted early:

replace out=havelist with out=havelist(where=(col1 ne .)) in the transpose statement

or late :

add and a.col1 is not missing after calculated warning is not missing.

PG

PG
Tom
Super User Tom
Super User

How detailed do you want to get?

PROC UNIVARIATE with the ID parameter will print top and bottom ten values with the associated id.

proc univariate data=sashelp.class;

  id name;

  var height;

run;

The UNIVARIATE Procedure

Variable:  Height

                   Extreme Observations

----------Lowest---------        ---------Highest---------

Value   Name          Obs        Value   Name          Obs

51.3   Joyce          11         66.5   Mary           14

56.3   Louise         13         66.5   William        19

56.5   Alice           2         67.0   Ronald         17

57.3   James           6         69.0   Alfred          1

57.5   Thomas         18         72.0   Philip         15

Hawkeye
Calcite | Level 5

I am trying to get specifically outliers.  But also with proc univariate you can put nextrobs = (some number) after data and it will give you however many you put there to get that many on your lowest and highest.  So if you put "nextrobs = 10" you will get the 10 highest and lowest..

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1498 views
  • 3 likes
  • 6 in conversation