Solved: Trimming the top and bottom 1% of data in PROC SQL

mahler_ji · Posted 01-24-2014 10:52 AM

Hello,

Thank you all so much for your help lately, forum members. I really appreciate it.

I have a large set of data that is sorted by a variable, and I would like to exclude the top 1% and bottom 1% of the data based on this ranking.

proc sql;

create table rank as

select var

order by var;

quit;

What do I need in there to get rid of the top 1% and bottom 1% of that ordering?

Thanks!

bill0101 · Posted 01-25-2014 01:29 PM

I recommend proc rank for the following 3 reasons:

1. deals with ties appropriately

2. does not require a preceding proc sort

3. very easy to maintain or incorporate with other filter criteria

The essence is that proc rank allows you to gain full control over your filter process without modifying the original dataset. Also consider the following situation:

* If your advisor or referee asks you to WINSORIZE your data instead of deleting them, you will have no choice but using proc rank.

* If you have filter criteria other than percentiles(and I believe this is the usual case), you can easily output a relatively small rank-only dataset to incorporate with them.

* If you have many variables that require filtering, you can rank them all at once. Repeating proc sort and data deletion many times will be extremely time-wasting and always leads to unwanted information loss.

As I've mentioned earlier, trimming by Nobs (either by dataset option or macro) serves well for one single purpose. However, it is really not a good choice regarding possible future work. I am also dealing with a ~10M obs CRSP dataset, writing a now 1800-lined code and I do not like leaving any known pain in my own ass. Hope my personal experience helps.

Bill

View solution in original post

LinusH · Posted 01-24-2014 11:20 AM

You can create a step that figures out the no of observations, which will let you know how many rows 1% are. Hint, you may use the ATTRN function.

Then use FIRSTOBS= and OBS= data step options to filter your data.

Data never sleeps

ScottBass · Posted 01-24-2014 05:59 PM

I agree with Linus. If by "top and bottom 1%" you mean by data volume rather than data distribution, then just derive the first and last obs for that percentage. I'd think this would perform much faster than PROC RANK for large datasets.

Something like this (code untested, so just use it as a guide):

data _null_;

if 0 then set sashelp.zipcode nobs=nobs;

call symputx("first",int(1/nobs));

call symputx("last",int( (1/nobs) * 99));

run;

data trimmed;

set sashelp.zipcode firstobs=&first obs=&last;

run;

If you're pedantic, use round() instead of int().

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

bill0101 · Posted 01-24-2014 11:37 AM

Hello,

proc rank would simply fit your need.

There is my own code template:

/*** get percentile ***/

/* groups = 100 for percentile (new percentile variable is 0-99)*/

/* for INSERTING a rank variable to the original table, just let <dataset_in> = <dataset_out> */

proc rank data=<dataset_in> groups=100 out=<dataset_out>;

var <var1 var2 ...>;

by <grouping_vars>;

ranks <percentile1 percentile2 ...>; /*variable names of percentile variables*/

run;

In case of frequent ties or varying filter values, consider the following more general code:

/* fractional rank */

* Note:

* (1) fraction: output fractional rank divided by NObs

* nplus1: fractional rank divided by NObs+1

* (2) ties=mean: output average rank value for ties;

proc rank data=<in_data> Nplus1 ties=mean out=<out_data>;

var <x y ...>;

by <Grouping_Vars>;

ranks <x_rnk y_rnk ...>;

run;

After ranking, you can easily delete the unwanted value in later data steps.

(e.g. if x_rnk>0.99 or x_rnk<0.01 then delete;)

Hope this helps.

Bill

art297 · Posted 01-24-2014 11:39 AM

If your variable is normally distributed, and you have a sufficiently large enough dataset, you could simply use SQL to select those records whose values for that variable falls within -2.06 and +2.06 standard deviations from the mean.

bill0101 · Posted 01-24-2014 12:32 PM

The most concise way to trim a sorted dataset is probably utilizing set option Nobs= .

However, I do not recommend doing this for 2 reasons:

(1) In many cases, especially for finance researches, ranking or grouping is still needed in further analyses.

(2) For nominal or ordinal ranking variables, ties must be treated carefully. Proc rank provides all the fancy options.

For reference, there is my working test code. You can see that it does trim 1000 records to 980.

/* generate a dataset with 1000 records */

data a;

do ID=1 to 10;

do x=1 to 100;

output;

end;

run;

%let trim=0.01; /* trim ratio */

/* conditional output using set option Nobs= */

data b;

set a Nobs=NN;

if &trim*NN < _N_ <= (1-&trim)*NN then output;

run; *expect 980 records in b;

Fugue · Posted 01-24-2014 07:59 PM

Just to demonstrate that you can also use PROC SQL to rank and trim data, although it may not be very efficient for large datasets. Also, as others have pointed out, the following approach would not deal with ties very well.

/* create some fake data */

data test;

do idvar=1 to 1000;

x=ranuni(1234);

output;

end;

run;

/* trim value */

%let trim=0.01;

%put &trim;

/* Largest values ranked 1. Use b.x <= a.x for reverse ranking. */

proc sql;

create table rankd_trimd as

select a.*

, ( select count ( distinct b.x )

from test b

where b.x >= a.x

) as rankvar

, count ( * ) as obs

from test a

having obs*&trim < rankvar <= obs*(1-&trim)

order by rankvar

;

quit;

mahler_ji · Posted 01-25-2014 11:00 AM

Hello all,

Thank you very much for your replies. I think that I have a good place to start.

I did want to add that, as some have pointed out, I am working with finance related issues, and my datasets are tens of millions of rows.

I don't know if that matters with the code, but I am just looking to rank all of the rows based on a given variable, and then figure out how to get rid of the top and the bottom 1% of data, based on the rankings made by the variables.

Said another way, I would like to throw out the variables which lie in the top 1% and bottom 1% of the distribution for that variable.

I will give 's version a try, but if any of you have more suggestions, that would be great.

Thanks,

John

bill0101 · Posted 01-25-2014 01:29 PM

I recommend proc rank for the following 3 reasons:

1. deals with ties appropriately

2. does not require a preceding proc sort

3. very easy to maintain or incorporate with other filter criteria

The essence is that proc rank allows you to gain full control over your filter process without modifying the original dataset. Also consider the following situation:

* If your advisor or referee asks you to WINSORIZE your data instead of deleting them, you will have no choice but using proc rank.

* If you have filter criteria other than percentiles(and I believe this is the usual case), you can easily output a relatively small rank-only dataset to incorporate with them.

* If you have many variables that require filtering, you can rank them all at once. Repeating proc sort and data deletion many times will be extremely time-wasting and always leads to unwanted information loss.

As I've mentioned earlier, trimming by Nobs (either by dataset option or macro) serves well for one single purpose. However, it is really not a good choice regarding possible future work. I am also dealing with a ~10M obs CRSP dataset, writing a now 1800-lined code and I do not like leaving any known pain in my own ass. Hope my personal experience helps.

Bill

mahler_ji · Posted 01-27-2014 02:00 PM

Thank you very much for your help. I am working within a CRSP dataset as well, so I think that your help is especially poignant.

I will do this using PROC RANK.

If I have any more questions, I will be sure to as you.

Thanks again,

John

Fugue · Posted 01-27-2014 02:12 PM

I concur with all of the reasons that others have given, re: using PROC RANK, etc. PROC RANK is preferable hands down. However, I just want to point out that your original question was how to use PROC SQL to trim variables based on their ranking.

mahler_ji · Posted 02-07-2014 12:19 PM

Hey ,

I am working in the CRSP and TR databases right now and I have some questions for someone that knows their way around these. I think that we may be able to send direct messages to each other on here, but if not and you're willing to help for a bit maybe you could shoot me an email at mahler.ji@gmail.com.

I am looking for a way to filter out mutual funds that do not report holdings in consecutive quarters, and I'm not sure how to work with the dates.

Thanks for your help,

John

Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

Re: Trimming the top and bottom 1% of data in PROC SQL

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away