top n% per group

rcogan · Posted 11-04-2019 12:35 PM

Relatively new SAS user (Enterprise Guide 7.1) and programmer in general so thanks in advance for bearing with me.

I am looking to find and flag the top %n of members in a group. My dataset roughly resembles this (but think hundreds of millions of observations):

member_ID Group Cost

1 A $10

2 A $8

3 A $1

4 A $0

5 B $15

6 B $10

7 B $7

8 C $13

9 C $5

10 C $0

I am trying to find the top 5% of members per group and seeing how much they collectively cost. The data set is ordered so that cost is descending per group, so it should be as easy as slicing off the top and flagging them. My example above is bad because you can't get 5% from only a couple observations, but hopefully you get the idea.

This was the code I used before I added group into the mix and it gave me a new dataset with the top 5% of total members successfully.

data top5;
do _N_=1 to round(nobs*.05);
set my_data_set point=_N_ nobs=nobs;
output top5;
end;
stop;
run;

PaigeMiller · Posted 11-04-2019 12:45 PM

If you use PROC RANK with the option GROUPS = 100, each observation in each group will get a rank between 0 and 99 (or maybe it 1 to 100, I don't remember exactly). Then if you want the top 23%, you find all observations where the rank is <=22. This way, you also have control over how to handle ties.

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=proc&docsetTarget=p0le3p5...

--
Paige Miller

PGStats · Posted 11-04-2019 02:57 PM

Your question is not clear.

If you want the top costs which taken together amount to less than 5% of the total cost then

data want;
do until(last.member_id);
	set have; by member_id;
 	totalCost = sum(totalCost, cost);
 	end;
do until(last.member_id);
	set have; by member_id;
    cumCost = sum(cumCost, cost);
    if cumCost <= 0.05 * totalCost then output;
    end;
drop totalCost cumCost;
run;

If you want all costs that individually amount to less than 5% of the total cost then:

data want;
do until(last.member_id);
	set have; by member_id;
 	totalCost = sum(totalCost, cost);
 	end;
do until(last.member_id);
	set have; by member_id;
    if cost <= 0.05 * totalCost then output;
    end;
drop totalCost;
run;

PG

rcogan · Posted 11-07-2019 03:55 PM

Thanks for your reply. After some more thinking I realize I want the first 5% of observations that appear in the dataset, for each of the various sized groups. Which I think takes the cost variable out of the equation.

For example, for a group that has 100 members, I want to flag the first 5. For the group has 25,643 members I want to flag the first 1,282 which appear in the dataset under that group.

I hope that clears things up! Thanks.

Patrick · Posted 11-07-2019 05:11 PM

@rcogan

Do you understand the code @PGStats posted?

You need only replace the summing of cost with a count of obs in the first loop and then in the second loop only output the first n% of the group count of obs.

rcogan · Posted 11-07-2019 05:46 PM

Thanks for walking me through that

top n% per group

Re: top n% per group

Re: top n% per group

Re: top n% per group

Re: top n% per group

Re: top n% per group

top n% per group

Re: top n% per group

Re: top n% per group

Re: top n% per group

Re: top n% per group

Re: top n% per group

SAS Innovate 2025: Register Now