topic Re: Getting rid of outlier data in SAS Enterprise Guide

Getting rid of outlier data

abyss — Tue, 06 Mar 2007 15:33:05 GMT

Let's say I have a dataset with 10,000 observations consisting of two variables. "Occupation" and "income". There are 5 different occupations in the dataset.

I would like to get rid of highest 10% of observed income levels for each occupation.

Can anybody help me on how to crack this with SAS or send me a simple code?

THanks in advance

Re: Getting rid of outlier data

Olivier — Wed, 07 Mar 2007 15:47:45 GMT

First count how many observations you have for each occupation, then read data skipping the last observations.

Anyway I'm not sure this is the most statistically-reliable way to filter outlier ; you'd better consider the values distribution to find a "gap" where to put a cutoff for each occupation, rather than deciding that some x % of data is "false". But this would require a lot more time than running something like :

PROC SQL ;
CREATE TABLE work.incomes AS
SELECT *,
COUNT(*) AS obsNb
FROM yourDataSet
GROUP BY occupation
ORDER BY occupation, income
;
QUIT ;
DATA work.incomes ;
SET work.incomes ;
BY occupation ;
IF FIRST.occupation THEN currNb = 0 ;
currNb + 1 ;
IF currNb <= obsNb*.99 THEN OUTPUT ;
RUN ;

Re: Getting rid of outlier data

Doc_Duke — Wed, 14 Mar 2007 23:57:10 GMT

In EG 4.1, you can use the rank task to generate decile ranks by income group and then use the filter task to remove the largest value.

Re: Getting rid of outlier data

deleted_user — Thu, 15 Mar 2007 17:34:30 GMT

automating the process could use code like below
( I used test data =sashelp.class
grouping by sex
filtering off top 10% of weight )
[prE]
* fast p90 filter;
%let data= sashelp.class ;
%let bygrp = sex ;
%let filtr = weight ;
* collect 90% level in each by group /class;
proc means data= &data noprint nway ;
var &filtr ;
class &bygrp ;
output out= work.P90S p90= p90 ;
quit;
* prepare lookup table that will return the P90
for a value of the bygroup / class var ;
data cntl;
retain fmtname 'p90s' type 'I';
set;
run;
* build the informat: class var is "range" and the
statistic is the "label" to be returned ;
proc format cntlin= cntl( rename=( &bygrp= start p90= label )); run;
* split the data on the classvar-based P90 values;
data reduced unwanted;
set &data ;
if &filtr >= input( &bygrp, p90s. ) then output unwanted;
else output reduced ;
run;
/* alternate solution using just a where filter
where &filtr < input( &bygrp, p90s. ) ;
********/
[/prE]

peterC

Re: Getting rid of outlier data

deleted_user — Thu, 19 Jun 2008 10:21:26 GMT

Hi,

How could I get rid of extreme rows ,let's say the top and the bottom 2 rows, of a table, according to the fact that their are 200 variables?

Thanks in advance,

Anais