Solved: Re: Top N rows per by-group

jdmarino · Posted 02-12-2014 05:10 PM

I don't know how to pull out the top N rows per by-group. I'm a long-time SQL querier, and this kind of thing, where N is specified, is not something SQL is good at. For example, I'd like to pull out the rows of the top 2 returns per sector. This needs to work on a large dataset (10M+ rows) where I need the top 100 values in a handful of categories. All good ideas are welcome.

data Stocks;

input Ticker :$3. Sector :$4. Return;

datalines;

ABC Tech 0.10

DEF Tech 0.30

GHI Tech 0.01

JKL Tech 0.12

YZA Food 0.01

VWX Food 0.09

STU Food 0.08

PQR Food 0.05

MNO Food 0.02

BCD Booz 0.10

EFG Booz 0.11

GMC Cars 0.01

;

/*

desired output: a dataset with the rows which are the top 2 returns in each sector even if there are fewer than N rows in the sector

DEF Tech 0.30

JKL Tech 0.12

VWX Food 0.09

STU Food 0.08

BCD Booz 0.10

EFG Booz 0.11

GMC Cars 0.01

*/

Jagadishkatam · Posted 02-13-2014 09:34 AM

Since from your post i see you are interested in proc steps, i used PROC RANK and it is generating the desired output.

proc sort data= Stocks;

by sector;

run;

proc rank data=Stocks out=r_test descending;

by Sector;

var Return;

ranks r_Return;

run;

proc sort data=r_test;

by sector r_Return ;

where r_Return <=2;

run;

Please try the code. Hope you like it

Thanks,

Jag

Thanks,
Jag

View solution in original post

data_null__ · Posted 02-12-2014 06:00 PM

What is the rule for ties?

Is your 10+ million rows sorted or indexed. Maybe on Sector?

jdmarino · Posted 02-13-2014 07:34 AM

The rule for ties: do what is easiest/most obvious. If that means including all rows that are tied so that the total count per group > N, that's OK.

The data can be sorted or indexed.

FYI, I'm pretty sure I can do this in a data-step, but that seems a little "grow you own". I'm guessing there is a proc or feature I've never heard of that will do it for me. Hence this post.

data_null__ · Posted 02-13-2014 08:19 AM

PROC UNIVARIATE has two options one or both should meet your needs.

NEXTROBS=

NEXTRVAL=

NEXT=Number of extream

You will ID statement too.

Message was edited by: data _null_

jdmarino · Posted 02-13-2014 08:59 AM

I can't find those options mentioned for proc univariate. I have SAS 9.3; could those be new options?

data_null__ · Posted 02-13-2014 09:09 AM

They are PROC statement options.

jdmarino · Posted 02-13-2014 09:46 AM

http://support.sas.com/cdlsearch?charset=iso-8859-1&nh=25&ct=80000&qt=nextobs++spr%3A%229.3qq_%22&sa...

I searched for "nextobs" and found nothing. But that's ok: Jag's solution seems right.

data_null__ · Posted 02-13-2014 10:33 AM

typo it is NEXTROBS. Guess you didn't think to search for PROC UNIVARIATE.

Jagadishkatam · Posted 02-13-2014 01:08 AM

Please try the below code by proc sql

proc sql;

create table want as select distinct a.ticker, a.sector, a.return from stocks as a left join stocks as b on (a.sector=b.sector and a.return le b.return)

group by a.sector,a.return having count(*) <=2 order by a.sector, a.return desc;

quit;

Thanks,

Jag

Thanks,
Jag

jdmarino · Posted 02-13-2014 07:55 AM

You know, I have used this technique before, and it does work on my sample data. I'll try it on my big dataset, but the intermediate data may become too large for my machine.

Update: after an hour, it's still running! No wonder I put that technique out of my brain.

Message was edited by: John Marino

Jagadishkatam · Posted 02-13-2014 09:34 AM

Since from your post i see you are interested in proc steps, i used PROC RANK and it is generating the desired output.

proc sort data= Stocks;

by sector;

run;

proc rank data=Stocks out=r_test descending;

by Sector;

var Return;

ranks r_Return;

run;

proc sort data=r_test;

by sector r_Return ;

where r_Return <=2;

run;

Please try the code. Hope you like it

Thanks,

Jag

Thanks,
Jag

jdmarino · Posted 02-13-2014 10:13 AM

This is the right way to do it. Took 18 seconds on 500k rows. I don't know why this did not occur to me; I use proc rank pretty often. Viva la Community!

data_null__ · Posted 02-13-2014 10:43 AM

Can you move that WHERE statement to the PROC RANK step as a data set option on the OUT= dataset?

Jagadishkatam · Posted 02-13-2014 07:55 PM

thank you data_null_, we can also use where option in the same proc rank as below

proc sort data= Stocks;

by sector;

run;

proc rank data=Stocks out=r_test(where=(r_Return <=2)) descending;

by Sector;

var Return;

ranks r_Return;

run;

proc sort data=r_test;

by sector r_Return ;

run;

Thanks,

Jag

Thanks,
Jag

SAS Innovate 2025: Call for Content

Classroom Training Available!