I don't know how to pull out the top N rows per by-group. I'm a long-time SQL querier, and this kind of thing, where N is specified, is not something SQL is good at. For example, I'd like to pull out the rows of the top 2 returns per sector. This needs to work on a large dataset (10M+ rows) where I need the top 100 values in a handful of categories. All good ideas are welcome.
data Stocks;
input Ticker :$3. Sector :$4. Return;
datalines;
ABC Tech 0.10
DEF Tech 0.30
GHI Tech 0.01
JKL Tech 0.12
YZA Food 0.01
VWX Food 0.09
STU Food 0.08
PQR Food 0.05
MNO Food 0.02
BCD Booz 0.10
EFG Booz 0.11
GMC Cars 0.01
;
/*
desired output: a dataset with the rows which are the top 2 returns in each sector even if there are fewer than N rows in the sector
DEF Tech 0.30
JKL Tech 0.12
VWX Food 0.09
STU Food 0.08
BCD Booz 0.10
EFG Booz 0.11
GMC Cars 0.01
*/
Since from your post i see you are interested in proc steps, i used PROC RANK and it is generating the desired output.
proc sort data= Stocks;
by sector;
run;
proc rank data=Stocks out=r_test descending;
by Sector;
var Return;
ranks r_Return;
run;
proc sort data=r_test;
by sector r_Return ;
where r_Return <=2;
run;
Please try the code. Hope you like it
Thanks,
Jag
What is the rule for ties?
Is your 10+ million rows sorted or indexed. Maybe on Sector?
The rule for ties: do what is easiest/most obvious. If that means including all rows that are tied so that the total count per group > N, that's OK.
The data can be sorted or indexed.
FYI, I'm pretty sure I can do this in a data-step, but that seems a little "grow you own". I'm guessing there is a proc or feature I've never heard of that will do it for me. Hence this post.
PROC UNIVARIATE has two options one or both should meet your needs.
NEXTROBS=
NEXTRVAL=
NEXT=Number of extream
You will ID statement too.
Message was edited by: data _null_
I can't find those options mentioned for proc univariate. I have SAS 9.3; could those be new options?
They are PROC statement options.
I searched for "nextobs" and found nothing. But that's ok: Jag's solution seems right.
typo it is NEXTROBS. Guess you didn't think to search for PROC UNIVARIATE.
Please try the below code by proc sql
proc sql;
create table want as select distinct a.ticker, a.sector, a.return from stocks as a left join stocks as b on (a.sector=b.sector and a.return le b.return)
group by a.sector,a.return having count(*) <=2 order by a.sector, a.return desc;
quit;
Thanks,
Jag
You know, I have used this technique before, and it does work on my sample data. I'll try it on my big dataset, but the intermediate data may become too large for my machine.
Update: after an hour, it's still running! No wonder I put that technique out of my brain.
Message was edited by: John Marino
Since from your post i see you are interested in proc steps, i used PROC RANK and it is generating the desired output.
proc sort data= Stocks;
by sector;
run;
proc rank data=Stocks out=r_test descending;
by Sector;
var Return;
ranks r_Return;
run;
proc sort data=r_test;
by sector r_Return ;
where r_Return <=2;
run;
Please try the code. Hope you like it
Thanks,
Jag
This is the right way to do it. Took 18 seconds on 500k rows. I don't know why this did not occur to me; I use proc rank pretty often. Viva la Community!
Can you move that WHERE statement to the PROC RANK step as a data set option on the OUT= dataset?
thank you data_null_, we can also use where option in the same proc rank as below
proc sort data= Stocks;
by sector;
run;
proc rank data=Stocks out=r_test(where=(r_Return <=2)) descending;
by Sector;
var Return;
ranks r_Return;
run;
proc sort data=r_test;
by sector r_Return ;
run;
Thanks,
Jag
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.