BookmarkSubscribeRSS Feed
rcogan
Fluorite | Level 6

Relatively new SAS user (Enterprise Guide 7.1) and programmer in general so thanks in advance for bearing with me.

 

I am looking to find and flag the top %n of members in a group. My dataset roughly resembles this (but think hundreds of millions of observations):

member_ID         Group        Cost

  1                            A             $10

  2                            A             $8

  3                            A             $1

  4                            A             $0

  5                            B             $15

  6                            B             $10

  7                            B             $7

  8                            C             $13

  9                            C             $5

  10                          C             $0

 

 

I am trying to find the top 5% of members per group and seeing how much they collectively cost. The data set is ordered so that cost is descending per group, so it should be as easy as slicing off the top and flagging them. My example above is bad because you can't get 5% from only a couple observations, but hopefully you get the idea.

 

This was the code I used before I added group into the mix and it gave me a new dataset with the top 5% of total members successfully. 

 

data top5;
do _N_=1 to round(nobs*.05);
set my_data_set point=_N_ nobs=nobs;
output top5;
end;
stop;
run;

5 REPLIES 5
PaigeMiller
Diamond | Level 26

If you use PROC RANK with the option GROUPS  = 100, each observation in each group will get a rank between 0 and 99 (or maybe it 1 to 100, I don't remember exactly). Then if you want the top 23%, you find all observations where the rank is <=22. This way, you also have control over how to handle ties.

 

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=proc&docsetTarget=p0le3p5...

--
Paige Miller
PGStats
Opal | Level 21

Your question is not clear.

 

If you want the top costs which taken together amount to less than 5% of the total cost then

 

data want;
do until(last.member_id);
	set have; by member_id;
 	totalCost = sum(totalCost, cost);
 	end;
do until(last.member_id);
	set have; by member_id;
    cumCost = sum(cumCost, cost);
    if cumCost <= 0.05 * totalCost then output;
    end;
drop totalCost cumCost;
run;

If you want all costs that individually amount to less than 5% of the total cost then:

 

data want;
do until(last.member_id);
	set have; by member_id;
 	totalCost = sum(totalCost, cost);
 	end;
do until(last.member_id);
	set have; by member_id;
    if cost <= 0.05 * totalCost then output;
    end;
drop totalCost;
run;
PG
rcogan
Fluorite | Level 6

Thanks for your reply. After some more thinking I realize I want the first 5% of observations that appear in the dataset, for each of the various sized groups. Which I think takes the cost variable out of the equation. 

 

For example, for a group that has 100 members, I want to flag the first 5. For the group has 25,643 members I want to flag the first 1,282 which appear in the dataset under that group.

 

I hope that clears things up! Thanks.

Patrick
Opal | Level 21

@rcogan 

Do you understand the code @PGStats posted?

You need only replace the summing of cost with a count of obs in the first loop and then in the second loop only output the first n% of the group count of obs.

rcogan
Fluorite | Level 6
Thanks for walking me through that

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 800 views
  • 5 likes
  • 4 in conversation