DATA Step, Macro, Functions and more

Top N rows per by-group

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 78
Accepted Solution

Top N rows per by-group

I don't know how to pull out the top N rows per by-group.  I'm a long-time SQL querier, and this kind of thing, where N is specified, is not something SQL is good at.  For example, I'd like to pull out the rows of the top 2 returns per sector.  This needs to work on a large dataset (10M+ rows) where I need the top 100 values in a handful of categories.  All good ideas are welcome.

data Stocks;

    input Ticker :$3. Sector :$4. Return;

    datalines;

ABC Tech 0.10

DEF Tech 0.30

GHI Tech 0.01

JKL Tech 0.12

YZA Food 0.01

VWX Food 0.09

STU Food 0.08

PQR Food 0.05

MNO Food 0.02

BCD Booz 0.10

EFG Booz 0.11

GMC Cars 0.01

;

/*

desired output: a dataset with the rows which are the top 2 returns in each sector even if there are fewer than N rows in the sector

DEF Tech 0.30

JKL Tech 0.12

VWX Food 0.09

STU Food 0.08

BCD Booz 0.10

EFG Booz 0.11

GMC Cars 0.01

*/


Accepted Solutions
Solution
‎02-13-2014 09:34 AM
Trusted Advisor
Posts: 1,128

Re: Top N rows per by-group

Since from your post i see you are interested in proc steps, i used PROC RANK  and it is generating the desired output.

proc sort data=    Stocks;

    by sector;

run;

proc rank data=Stocks out=r_test descending;

by Sector;

var Return;

ranks r_Return;

run;

proc sort data=r_test;

    by sector r_Return ;

    where r_Return <=2;

run;

Please try the code. Hope you like it Smiley Happy

Thanks,

Jag

Thanks,
Jag

View solution in original post


All Replies
Respected Advisor
Posts: 3,777

Re: Top N rows per by-group

What is the rule for ties?

Is your 10+ million rows sorted or indexed.  Maybe on Sector?

Frequent Contributor
Posts: 78

Re: Top N rows per by-group

The rule for ties: do what is easiest/most obvious.  If that means including all rows that are tied so that the total count per group > N, that's OK.

The data can be sorted or indexed.

FYI, I'm pretty sure I can do this in a data-step, but that seems a little "grow you own".  I'm guessing there is a proc or feature I've never heard of that will do it for me.  Hence this post.

Respected Advisor
Posts: 3,777

Re: Top N rows per by-group

PROC UNIVARIATE has two options one or both should meet your needs.

NEXTROBS=

NEXTRVAL=

NEXT=Number of extream

You will ID statement too.

Message was edited by: data _null_

Frequent Contributor
Posts: 78

Re: Top N rows per by-group

I can't find those options mentioned for proc univariate.  I have SAS 9.3; could those be new options?

Respected Advisor
Posts: 3,777

Re: Top N rows per by-group

They are PROC statement options.

Frequent Contributor
Posts: 78

Re: Top N rows per by-group

http://support.sas.com/cdlsearch?charset=iso-8859-1&nh=25&ct=80000&qt=nextobs++spr%3A%229.3qq_%22&sa...

I searched for "nextobs" and found nothing.  But that's ok: Jag's solution seems right.

Respected Advisor
Posts: 3,777

Re: Top N rows per by-group

typo it is NEXTROBS.  Guess you didn't think to search for PROC UNIVARIATE.

Trusted Advisor
Posts: 1,128

Re: Top N rows per by-group

Please try the below code by proc sql

proc sql;

  create table want as select distinct a.ticker, a.sector, a.return from stocks as a left join stocks as b on (a.sector=b.sector and a.return le b.return)

  group by a.sector,a.return having count(*) <=2 order by a.sector, a.return desc;

quit;

Thanks,

Jag

Thanks,
Jag
Frequent Contributor
Posts: 78

Re: Top N rows per by-group

You know, I have used this technique before, and it does work on my sample data.  I'll try it on my big dataset, but the intermediate data may become too large for my machine.

Update: after an hour, it's still running!  No wonder I put that technique out of my brain. 

Message was edited by: John Marino

Solution
‎02-13-2014 09:34 AM
Trusted Advisor
Posts: 1,128

Re: Top N rows per by-group

Since from your post i see you are interested in proc steps, i used PROC RANK  and it is generating the desired output.

proc sort data=    Stocks;

    by sector;

run;

proc rank data=Stocks out=r_test descending;

by Sector;

var Return;

ranks r_Return;

run;

proc sort data=r_test;

    by sector r_Return ;

    where r_Return <=2;

run;

Please try the code. Hope you like it Smiley Happy

Thanks,

Jag

Thanks,
Jag
Frequent Contributor
Posts: 78

Re: Top N rows per by-group

This is the right way to do it.  Took 18 seconds on 500k rows.  I don't know why this did not occur to me; I use proc rank pretty often.  Viva la Community!

Respected Advisor
Posts: 3,777

Re: Top N rows per by-group

Can you move that WHERE statement to the PROC RANK step as a data set option on the OUT= dataset?

Trusted Advisor
Posts: 1,128

Re: Top N rows per by-group

thank you data_null_, we can also use where option in the same proc rank as below

proc sort data=    Stocks;

    by sector;

run;

proc rank data=Stocks out=r_test(where=(r_Return <=2)) descending;

by Sector;

var Return;

ranks r_Return;

run;

proc sort data=r_test;

    by sector r_Return ;

run;

Thanks,

Jag

Thanks,
Jag
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 11594 views
  • 4 likes
  • 3 in conversation