BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mahler_ji
Obsidian | Level 7

Hello,

Thank you all so much for your help lately, forum members.  I really appreciate it.

I have a large set of data that is sorted by a variable, and I would like to exclude the top 1% and bottom 1% of the data based on this ranking. 

proc sql;

     create table rank as

     select var

     order by var;

quit;

What do I need in there to get rid of the top 1% and bottom 1% of that ordering?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
bill0101
Calcite | Level 5

I recommend proc rank for the following 3 reasons:


1. deals with ties appropriately

2. does not require a preceding proc sort

3. very easy to maintain or incorporate with other filter criteria

The essence is that proc rank allows you to gain full control over your filter process without modifying the original dataset. Also consider the following situation:

* If your advisor or referee asks you to WINSORIZE your data instead of deleting them, you will have no choice but using proc rank.

* If you have filter criteria other than percentiles(and I believe this is the usual case), you can easily output a relatively small rank-only dataset to incorporate with them.

* If you have many variables that require filtering, you can rank them all at once. Repeating proc sort and data deletion many times will be extremely time-wasting and always leads to unwanted information loss.

As I've mentioned earlier, trimming by Nobs (either by dataset option or macro) serves well for one single purpose. However, it is really not a good choice regarding possible future work.  I am also dealing with a ~10M obs CRSP dataset, writing a now 1800-lined code and I do not like leaving any known pain in my own ass. Hope my personal experience helps.


Bill

View solution in original post

11 REPLIES 11
LinusH
Tourmaline | Level 20

You can create a step that figures out the no of observations, which will let you know how many rows 1% are. Hint, you may use the ATTRN function.

Then use FIRSTOBS= and OBS= data step options to filter your data.

Data never sleeps
ScottBass
Rhodochrosite | Level 12

I agree with Linus.  If by "top and bottom 1%" you mean by data volume rather than data distribution, then just derive the first and last obs for that percentage.  I'd think this would perform much faster than PROC RANK for large datasets.

Something like this (code untested, so just use it as a guide):

data _null_;

     if 0 then set sashelp.zipcode nobs=nobs;

     call symputx("first",int(1/nobs));

     call symputx("last",int( (1/nobs) * 99));

run;

data trimmed;

     set sashelp.zipcode firstobs=&first obs=&last;

run;

If you're pedantic, use round() instead of int().


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
bill0101
Calcite | Level 5

Hello,

proc rank would simply fit your need.

There is my own code template:

/*** get percentile ***/

/* groups = 100 for percentile (new percentile variable is 0-99)*/

/* for INSERTING a rank variable to the original table, just let <dataset_in> = <dataset_out> */

proc rank data=<dataset_in> groups=100 out=<dataset_out>;

     var <var1 var2 ...>;

     by <grouping_vars>;

     ranks <percentile1 percentile2 ...>; /*variable names of percentile variables*/

run;


In case of frequent ties or varying filter values, consider the following more general code:

/* fractional rank */

* Note:

* (1) fraction: output fractional rank divided by NObs

*     nplus1: fractional rank divided by NObs+1

* (2) ties=mean: output average rank value for ties;

proc rank data=<in_data> Nplus1 ties=mean out=<out_data>;

     var <x y ...>;

     by <Grouping_Vars>;

     ranks <x_rnk y_rnk ...>;

run;


After ranking, you can easily delete the unwanted value in later data steps.

(e.g. if x_rnk>0.99 or x_rnk<0.01 then delete;)


Hope this helps.


Bill

art297
Opal | Level 21

If your variable is normally distributed, and you have a sufficiently large enough dataset, you could simply use SQL to select those records whose values for that variable falls within -2.06 and +2.06 standard deviations from the mean.

bill0101
Calcite | Level 5

The most concise way to trim a sorted dataset is probably utilizing set option Nobs= .

However, I do not recommend doing this for 2 reasons:

(1) In many cases, especially for finance researches, ranking or grouping is still needed in further analyses.

(2) For nominal or ordinal ranking variables, ties must be treated carefully. Proc rank provides all the fancy options.

For reference, there is my working test code. You can see that it does trim 1000 records to 980.

/* generate a dataset with 1000 records */

data a;

     do ID=1 to 10;

           do x=1 to 100;

                output;

           end;

     end;

run;

%let trim=0.01; /* trim ratio */

/* conditional output using set option Nobs= */

data b;

     set a Nobs=NN;

     if &trim*NN < _N_ <= (1-&trim)*NN then output;

run; *expect 980 records in b;

Fugue
Quartz | Level 8

Just to demonstrate that you can also use PROC SQL to rank and trim data, although it may not be very efficient for large datasets. Also, as others have pointed out, the following approach would not deal with ties very well.

/* create some fake data */

data test;

     do idvar=1 to 1000;

       x=ranuni(1234);

     output;

       end;

run;

/* trim value */

%let trim=0.01;

%put &trim;

/* Largest values ranked 1. Use b.x <= a.x for reverse ranking. */

proc sql;

      create table rankd_trimd as

      select a.*

            , ( select count ( distinct b.x )

                  from test b

                  where b.x >= a.x

              ) as rankvar

            , count ( * ) as obs

      from test a

      having obs*&trim < rankvar <= obs*(1-&trim)

      order by rankvar

;

quit;

mahler_ji
Obsidian | Level 7

Hello all,

Thank you very much for your replies.  I think that I have a good place to start.

I did want to add that, as some have pointed out, I am working with finance related issues, and my datasets are tens of millions of rows.

I don't know if that matters with the code, but I am just looking to rank all of the rows based on a given variable, and then figure out how to get rid of the top and the bottom 1% of data, based on the rankings made by the variables.

Said another way, I would like to throw out the variables which lie in the top 1% and bottom 1% of the distribution for that variable. 

I will give 's version a try, but if any of you have more suggestions, that would be great.

Thanks,

John

bill0101
Calcite | Level 5

I recommend proc rank for the following 3 reasons:


1. deals with ties appropriately

2. does not require a preceding proc sort

3. very easy to maintain or incorporate with other filter criteria

The essence is that proc rank allows you to gain full control over your filter process without modifying the original dataset. Also consider the following situation:

* If your advisor or referee asks you to WINSORIZE your data instead of deleting them, you will have no choice but using proc rank.

* If you have filter criteria other than percentiles(and I believe this is the usual case), you can easily output a relatively small rank-only dataset to incorporate with them.

* If you have many variables that require filtering, you can rank them all at once. Repeating proc sort and data deletion many times will be extremely time-wasting and always leads to unwanted information loss.

As I've mentioned earlier, trimming by Nobs (either by dataset option or macro) serves well for one single purpose. However, it is really not a good choice regarding possible future work.  I am also dealing with a ~10M obs CRSP dataset, writing a now 1800-lined code and I do not like leaving any known pain in my own ass. Hope my personal experience helps.


Bill

mahler_ji
Obsidian | Level 7

Thank you very much for your help.  I am working within a CRSP dataset as well, so I think that your help is especially poignant. 

I will do this using PROC RANK.

If I have any more questions, I will be sure to as you.

Thanks again,

John

Fugue
Quartz | Level 8

I concur with all of the reasons that others have given, re: using PROC RANK, etc. PROC RANK is preferable hands down. However, I just want to point out that your original question was how to use PROC SQL to trim variables based on their ranking.

mahler_ji
Obsidian | Level 7

Hey ,

I am working in the CRSP and TR databases right now and I have some questions for someone that knows their way around these.  I think that we may be able to send direct messages to each other on here, but if not and you're willing to help for a bit maybe you could shoot me an email at mahler.ji@gmail.com.

I am looking for a way to filter out mutual funds that do not report holdings in consecutive quarters, and I'm not sure how to work with the dates.

Thanks for your help,

John

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 10802 views
  • 7 likes
  • 6 in conversation