BookmarkSubscribeRSS Feed
abyss
Calcite | Level 5
Let's say I have a dataset with 10,000 observations consisting of two variables. "Occupation" and "income". There are 5 different occupations in the dataset.

I would like to get rid of highest 10% of observed income levels for each occupation.

Can anybody help me on how to crack this with SAS or send me a simple code?

THanks in advance
4 REPLIES 4
Olivier
Pyrite | Level 9
First count how many observations you have for each occupation, then read data skipping the last observations.

Anyway I'm not sure this is the most statistically-reliable way to filter outlier ; you'd better consider the values distribution to find a "gap" where to put a cutoff for each occupation, rather than deciding that some x % of data is "false". But this would require a lot more time than running something like :

PROC SQL ;
CREATE TABLE work.incomes AS
SELECT *,
COUNT(*) AS obsNb
FROM yourDataSet
GROUP BY occupation
ORDER BY occupation, income
;
QUIT ;
DATA work.incomes ;
SET work.incomes ;
BY occupation ;
IF FIRST.occupation THEN currNb = 0 ;
currNb + 1 ;
IF currNb <= obsNb*.99 THEN OUTPUT ;
RUN ;
Doc_Duke
Rhodochrosite | Level 12
In EG 4.1, you can use the rank task to generate decile ranks by income group and then use the filter task to remove the largest value.
deleted_user
Not applicable
automating the process could use code like below
( I used test data =sashelp.class
grouping by sex
filtering off top 10% of weight )
[prE]
* fast p90 filter;
%let data= sashelp.class ;
%let bygrp = sex ;
%let filtr = weight ;
* collect 90% level in each by group /class;
proc means data= &data noprint nway ;
var &filtr ;
class &bygrp ;
output out= work.P90S p90= p90 ;
quit;
* prepare lookup table that will return the P90
for a value of the bygroup / class var ;
data cntl;
retain fmtname 'p90s' type 'I';
set;
run;
* build the informat: class var is "range" and the
statistic is the "label" to be returned ;
proc format cntlin= cntl( rename=( &bygrp= start p90= label )); run;
* split the data on the classvar-based P90 values;
data reduced unwanted;
set &data ;
if &filtr >= input( &bygrp, p90s. ) then output unwanted;
else output reduced ;
run;
/* alternate solution using just a where filter
where &filtr < input( &bygrp, p90s. ) ;
********/
[/prE]

peterC
deleted_user
Not applicable
Hi,

How could I get rid of extreme rows ,let's say the top and the bottom 2 rows, of a table, according to the fact that their are 200 variables?

Thanks in advance,

Anais

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 2182 views
  • 0 likes
  • 4 in conversation