Relatively new SAS user (Enterprise Guide 7.1) and programmer in general so thanks in advance for bearing with me.
I am looking to find and flag the top %n of members in a group. My dataset roughly resembles this (but think hundreds of millions of observations):
member_ID Group Cost
1 A $10
2 A $8
3 A $1
4 A $0
5 B $15
6 B $10
7 B $7
8 C $13
9 C $5
10 C $0
I am trying to find the top 5% of members per group and seeing how much they collectively cost. The data set is ordered so that cost is descending per group, so it should be as easy as slicing off the top and flagging them. My example above is bad because you can't get 5% from only a couple observations, but hopefully you get the idea.
This was the code I used before I added group into the mix and it gave me a new dataset with the top 5% of total members successfully.
data top5;
do _N_=1 to round(nobs*.05);
set my_data_set point=_N_ nobs=nobs;
output top5;
end;
stop;
run;
If you use PROC RANK with the option GROUPS = 100, each observation in each group will get a rank between 0 and 99 (or maybe it 1 to 100, I don't remember exactly). Then if you want the top 23%, you find all observations where the rank is <=22. This way, you also have control over how to handle ties.
Your question is not clear.
If you want the top costs which taken together amount to less than 5% of the total cost then
data want;
do until(last.member_id);
set have; by member_id;
totalCost = sum(totalCost, cost);
end;
do until(last.member_id);
set have; by member_id;
cumCost = sum(cumCost, cost);
if cumCost <= 0.05 * totalCost then output;
end;
drop totalCost cumCost;
run;
If you want all costs that individually amount to less than 5% of the total cost then:
data want;
do until(last.member_id);
set have; by member_id;
totalCost = sum(totalCost, cost);
end;
do until(last.member_id);
set have; by member_id;
if cost <= 0.05 * totalCost then output;
end;
drop totalCost;
run;
Thanks for your reply. After some more thinking I realize I want the first 5% of observations that appear in the dataset, for each of the various sized groups. Which I think takes the cost variable out of the equation.
For example, for a group that has 100 members, I want to flag the first 5. For the group has 25,643 members I want to flag the first 1,282 which appear in the dataset under that group.
I hope that clears things up! Thanks.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.