turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- BI
- /
- Enterprise Guide
- /
- Getting rid of outlier data

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-06-2007 10:33 AM

Let's say I have a dataset with 10,000 observations consisting of two variables. "Occupation" and "income". There are 5 different occupations in the dataset.

I would like to get rid of highest 10% of observed income levels for each occupation.

Can anybody help me on how to crack this with SAS or send me a simple code?

THanks in advance

I would like to get rid of highest 10% of observed income levels for each occupation.

Can anybody help me on how to crack this with SAS or send me a simple code?

THanks in advance

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-07-2007 10:47 AM

First count how many observations you have for each occupation, then read data skipping the last observations.

Anyway I'm not sure this is the most statistically-reliable way to filter outlier ; you'd better consider the values distribution to find a "gap" where to put a cutoff for each occupation, rather than deciding that some x % of data is "false". But this would require a lot more time than running something like :

PROC SQL ;

CREATE TABLE work.incomes AS

SELECT *,

COUNT(*) AS obsNb

FROM yourDataSet

GROUP BY occupation

ORDER BY occupation, income

;

QUIT ;

DATA work.incomes ;

SET work.incomes ;

BY occupation ;

IF FIRST.occupation THEN currNb = 0 ;

currNb + 1 ;

IF currNb <= obsNb*.99 THEN OUTPUT ;

RUN ;

Anyway I'm not sure this is the most statistically-reliable way to filter outlier ; you'd better consider the values distribution to find a "gap" where to put a cutoff for each occupation, rather than deciding that some x % of data is "false". But this would require a lot more time than running something like :

PROC SQL ;

CREATE TABLE work.incomes AS

SELECT *,

COUNT(*) AS obsNb

FROM yourDataSet

GROUP BY occupation

ORDER BY occupation, income

;

QUIT ;

DATA work.incomes ;

SET work.incomes ;

BY occupation ;

IF FIRST.occupation THEN currNb = 0 ;

currNb + 1 ;

IF currNb <= obsNb*.99 THEN OUTPUT ;

RUN ;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-14-2007 07:57 PM

In EG 4.1, you can use the rank task to generate decile ranks by income group and then use the filter task to remove the largest value.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

03-15-2007 01:34 PM

automating the process could use code like below

( I used test data =sashelp.class

grouping by sex

filtering off top 10% of weight )

[prE]

* fast p90 filter;

%let data= sashelp.class ;

%let bygrp = sex ;

%let filtr = weight ;

* collect 90% level in each by group /class;

proc means data= &data noprint nway ;

var &filtr ;

class &bygrp ;

output out= work.P90S p90= p90 ;

quit;

* prepare lookup table that will return the P90

for a value of the bygroup / class var ;

data cntl;

retain fmtname 'p90s' type 'I';

set;

run;

* build the informat: class var is "range" and the

statistic is the "label" to be returned ;

proc format cntlin= cntl( rename=( &bygrp= start p90= label )); run;

* split the data on the classvar-based P90 values;

data reduced unwanted;

set &data ;

if &filtr >= input( &bygrp, p90s. ) then output unwanted;

else output reduced ;

run;

/* alternate solution using just a where filter

where &filtr < input( &bygrp, p90s. ) ;

********/

[/prE]

peterC

( I used test data =sashelp.class

grouping by sex

filtering off top 10% of weight )

[prE]

* fast p90 filter;

%let data= sashelp.class ;

%let bygrp = sex ;

%let filtr = weight ;

* collect 90% level in each by group /class;

proc means data= &data noprint nway ;

var &filtr ;

class &bygrp ;

output out= work.P90S p90= p90 ;

quit;

* prepare lookup table that will return the P90

for a value of the bygroup / class var ;

data cntl;

retain fmtname 'p90s' type 'I';

set;

run;

* build the informat: class var is "range" and the

statistic is the "label" to be returned ;

proc format cntlin= cntl( rename=( &bygrp= start p90= label )); run;

* split the data on the classvar-based P90 values;

data reduced unwanted;

set &data ;

if &filtr >= input( &bygrp, p90s. ) then output unwanted;

else output reduced ;

run;

/* alternate solution using just a where filter

where &filtr < input( &bygrp, p90s. ) ;

********/

[/prE]

peterC

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-19-2008 06:21 AM

Hi,

How could I get rid of extreme rows ,let's say the top and the bottom 2 rows, of a table, according to the fact that their are 200 variables?

Thanks in advance,

Anais

How could I get rid of extreme rows ,let's say the top and the bottom 2 rows, of a table, according to the fact that their are 200 variables?

Thanks in advance,

Anais