Desktop productivity for business analysts and programmers

Getting rid of outlier data

Reply
N/A
Posts: 1

Getting rid of outlier data

Let's say I have a dataset with 10,000 observations consisting of two variables. "Occupation" and "income". There are 5 different occupations in the dataset.

I would like to get rid of highest 10% of observed income levels for each occupation.

Can anybody help me on how to crack this with SAS or send me a simple code?

THanks in advance
Super Contributor
Posts: 260

Re: Getting rid of outlier data

First count how many observations you have for each occupation, then read data skipping the last observations.

Anyway I'm not sure this is the most statistically-reliable way to filter outlier ; you'd better consider the values distribution to find a "gap" where to put a cutoff for each occupation, rather than deciding that some x % of data is "false". But this would require a lot more time than running something like :

PROC SQL ;
CREATE TABLE work.incomes AS
SELECT *,
COUNT(*) AS obsNb
FROM yourDataSet
GROUP BY occupation
ORDER BY occupation, income
;
QUIT ;
DATA work.incomes ;
SET work.incomes ;
BY occupation ;
IF FIRST.occupation THEN currNb = 0 ;
currNb + 1 ;
IF currNb <= obsNb*.99 THEN OUTPUT ;
RUN ;
Trusted Advisor
Posts: 2,114

Re: Getting rid of outlier data

In EG 4.1, you can use the rank task to generate decile ranks by income group and then use the filter task to remove the largest value.
N/A
Posts: 0

Re: Getting rid of outlier data

automating the process could use code like below
( I used test data =sashelp.class
grouping by sex
filtering off top 10% of weight )
[prE]
* fast p90 filter;
%let data= sashelp.class ;
%let bygrp = sex ;
%let filtr = weight ;
* collect 90% level in each by group /class;
proc means data= &data noprint nway ;
var &filtr ;
class &bygrp ;
output out= work.P90S p90= p90 ;
quit;
* prepare lookup table that will return the P90
for a value of the bygroup / class var ;
data cntl;
retain fmtname 'p90s' type 'I';
set;
run;
* build the informat: class var is "range" and the
statistic is the "label" to be returned ;
proc format cntlin= cntl( rename=( &bygrp= start p90= label )); run;
* split the data on the classvar-based P90 values;
data reduced unwanted;
set &data ;
if &filtr >= input( &bygrp, p90s. ) then output unwanted;
else output reduced ;
run;
/* alternate solution using just a where filter
where &filtr < input( &bygrp, p90s. ) ;
********/
[/prE]

peterC
N/A
Posts: 0

Re: Getting rid of outlier data

Hi,

How could I get rid of extreme rows ,let's say the top and the bottom 2 rows, of a table, according to the fact that their are 200 variables?

Thanks in advance,

Anais
Ask a Question
Discussion stats
  • 4 replies
  • 657 views
  • 0 likes
  • 4 in conversation