Trimming the top and bottom 1% of data in PROC SQL

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 101
Accepted Solution

Trimming the top and bottom 1% of data in PROC SQL

Hello,

Thank you all so much for your help lately, forum members.  I really appreciate it.

I have a large set of data that is sorted by a variable, and I would like to exclude the top 1% and bottom 1% of the data based on this ranking. 

proc sql;

     create table rank as

     select var

     order by var;

quit;

What do I need in there to get rid of the top 1% and bottom 1% of that ordering?

Thanks!


Accepted Solutions
Solution
‎01-25-2014 01:29 PM
Occasional Contributor
Posts: 17

Re: Trimming the top and bottom 1% of data in PROC SQL

I recommend proc rank for the following 3 reasons:


1. deals with ties appropriately

2. does not require a preceding proc sort

3. very easy to maintain or incorporate with other filter criteria

The essence is that proc rank allows you to gain full control over your filter process without modifying the original dataset. Also consider the following situation:

* If your advisor or referee asks you to WINSORIZE your data instead of deleting them, you will have no choice but using proc rank.

* If you have filter criteria other than percentiles(and I believe this is the usual case), you can easily output a relatively small rank-only dataset to incorporate with them.

* If you have many variables that require filtering, you can rank them all at once. Repeating proc sort and data deletion many times will be extremely time-wasting and always leads to unwanted information loss.

As I've mentioned earlier, trimming by Nobs (either by dataset option or macro) serves well for one single purpose. However, it is really not a good choice regarding possible future work.  I am also dealing with a ~10M obs CRSP dataset, writing a now 1800-lined code and I do not like leaving any known pain in my own ass. Hope my personal experience helps.


Bill

View solution in original post


All Replies
Super User
Posts: 5,254

Re: Trimming the top and bottom 1% of data in PROC SQL

You can create a step that figures out the no of observations, which will let you know how many rows 1% are. Hint, you may use the ATTRN function.

Then use FIRSTOBS= and OBS= data step options to filter your data.

Data never sleeps
Super Contributor
Posts: 376

Re: Trimming the top and bottom 1% of data in PROC SQL

I agree with Linus.  If by "top and bottom 1%" you mean by data volume rather than data distribution, then just derive the first and last obs for that percentage.  I'd think this would perform much faster than PROC RANK for large datasets.

Something like this (code untested, so just use it as a guide):

data _null_;

     if 0 then set sashelp.zipcode nobs=nobs;

     call symputx("first",int(1/nobs));

     call symputx("last",int( (1/nobs) * 99));

run;

data trimmed;

     set sashelp.zipcode firstobs=&first obs=&last;

run;

If you're pedantic, use round() instead of int().

Occasional Contributor
Posts: 17

Re: Trimming the top and bottom 1% of data in PROC SQL

Hello,

proc rank would simply fit your need.

There is my own code template:

/*** get percentile ***/

/* groups = 100 for percentile (new percentile variable is 0-99)*/

/* for INSERTING a rank variable to the original table, just let <dataset_in> = <dataset_out> */

proc rank data=<dataset_in> groups=100 out=<dataset_out>;

     var <var1 var2 ...>;

     by <grouping_vars>;

     ranks <percentile1 percentile2 ...>; /*variable names of percentile variables*/

run;


In case of frequent ties or varying filter values, consider the following more general code:

/* fractional rank */

* Note:

* (1) fraction: output fractional rank divided by NObs

*     nplus1: fractional rank divided by NObs+1

* (2) ties=mean: output average rank value for ties;

proc rank data=<in_data> Nplus1 ties=mean out=<out_data>;

     var <x y ...>;

     by <Grouping_Vars>;

     ranks <x_rnk y_rnk ...>;

run;


After ranking, you can easily delete the unwanted value in later data steps.

(e.g. if x_rnk>0.99 or x_rnk<0.01 then deleteSmiley Wink


Hope this helps.


Bill

PROC Star
Posts: 7,356

Re: Trimming the top and bottom 1% of data in PROC SQL

If your variable is normally distributed, and you have a sufficiently large enough dataset, you could simply use SQL to select those records whose values for that variable falls within -2.06 and +2.06 standard deviations from the mean.

Occasional Contributor
Posts: 17

Re: Trimming the top and bottom 1% of data in PROC SQL

The most concise way to trim a sorted dataset is probably utilizing set option Nobs= .

However, I do not recommend doing this for 2 reasons:

(1) In many cases, especially for finance researches, ranking or grouping is still needed in further analyses.

(2) For nominal or ordinal ranking variables, ties must be treated carefully. Proc rank provides all the fancy options.

For reference, there is my working test code. You can see that it does trim 1000 records to 980.

/* generate a dataset with 1000 records */

data a;

     do ID=1 to 10;

           do x=1 to 100;

                output;

           end;

     end;

run;

%let trim=0.01; /* trim ratio */

/* conditional output using set option Nobs= */

data b;

     set a Nobs=NN;

     if &trim*NN < _N_ <= (1-&trim)*NN then output;

run; *expect 980 records in b;

Super Contributor
Posts: 307

Re: Trimming the top and bottom 1% of data in PROC SQL

Just to demonstrate that you can also use PROC SQL to rank and trim data, although it may not be very efficient for large datasets. Also, as others have pointed out, the following approach would not deal with ties very well.

/* create some fake data */

data test;

     do idvar=1 to 1000;

       x=ranuni(1234);

     output;

       end;

run;

/* trim value */

%let trim=0.01;

%put &trim;

/* Largest values ranked 1. Use b.x <= a.x for reverse ranking. */

proc sql;

      create table rankd_trimd as

      select a.*

            , ( select count ( distinct b.x )

                  from test b

                  where b.x >= a.x

              ) as rankvar

            , count ( * ) as obs

      from test a

      having obs*&trim < rankvar <= obs*(1-&trim)

      order by rankvar

;

quit;

Frequent Contributor
Posts: 101

Re: Trimming the top and bottom 1% of data in PROC SQL

Hello all,

Thank you very much for your replies.  I think that I have a good place to start.

I did want to add that, as some have pointed out, I am working with finance related issues, and my datasets are tens of millions of rows.

I don't know if that matters with the code, but I am just looking to rank all of the rows based on a given variable, and then figure out how to get rid of the top and the bottom 1% of data, based on the rankings made by the variables.

Said another way, I would like to throw out the variables which lie in the top 1% and bottom 1% of the distribution for that variable. 

I will give 's version a try, but if any of you have more suggestions, that would be great.

Thanks,

John

Solution
‎01-25-2014 01:29 PM
Occasional Contributor
Posts: 17

Re: Trimming the top and bottom 1% of data in PROC SQL

I recommend proc rank for the following 3 reasons:


1. deals with ties appropriately

2. does not require a preceding proc sort

3. very easy to maintain or incorporate with other filter criteria

The essence is that proc rank allows you to gain full control over your filter process without modifying the original dataset. Also consider the following situation:

* If your advisor or referee asks you to WINSORIZE your data instead of deleting them, you will have no choice but using proc rank.

* If you have filter criteria other than percentiles(and I believe this is the usual case), you can easily output a relatively small rank-only dataset to incorporate with them.

* If you have many variables that require filtering, you can rank them all at once. Repeating proc sort and data deletion many times will be extremely time-wasting and always leads to unwanted information loss.

As I've mentioned earlier, trimming by Nobs (either by dataset option or macro) serves well for one single purpose. However, it is really not a good choice regarding possible future work.  I am also dealing with a ~10M obs CRSP dataset, writing a now 1800-lined code and I do not like leaving any known pain in my own ass. Hope my personal experience helps.


Bill

Frequent Contributor
Posts: 101

Re: Trimming the top and bottom 1% of data in PROC SQL

Thank you very much for your help.  I am working within a CRSP dataset as well, so I think that your help is especially poignant. 

I will do this using PROC RANK.

If I have any more questions, I will be sure to as you.

Thanks again,

John

Super Contributor
Posts: 307

Re: Trimming the top and bottom 1% of data in PROC SQL

I concur with all of the reasons that others have given, re: using PROC RANK, etc. PROC RANK is preferable hands down. However, I just want to point out that your original question was how to use PROC SQL to trim variables based on their ranking.

Frequent Contributor
Posts: 101

Re: Trimming the top and bottom 1% of data in PROC SQL

Hey ,

I am working in the CRSP and TR databases right now and I have some questions for someone that knows their way around these.  I think that we may be able to send direct messages to each other on here, but if not and you're willing to help for a bit maybe you could shoot me an email at mahler.ji@gmail.com.

I am looking for a way to filter out mutual funds that do not report holdings in consecutive quarters, and I'm not sure how to work with the dates.

Thanks for your help,

John

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 4699 views
  • 7 likes
  • 6 in conversation