Hi,
I have a dataset with county and complaint variables. I want to get the top 5 complaints by county.
E.g. County Complaint Count
ABC sick 100
DEF Sick 200
ABC Chest Pain 200
DEF Chest Pain 300
ABC Heart problem 500
I want to get 10 5 complaints for each county E.g for County ABC SICK, CHESTPAIN, HEARTPROBLEM, BREATHING PROBLEM and EYE PROBLEM are top 5 then I want the list of the county name with top 5 complaints and the count of it.
Can I get any help. I did proc freq and got the list of all complaint by County but not sure how to get the top 5 only for each county.
Thank you
M
Here is one way using a SAS supplied dataset so you should be able to test the code.
proc sort data=sashelp.cars out=carsort; by make descending msrp; run; data want; set carsort; retain makecount; by make; if first.make then makecount=1; else makecount+1; if makecount le 5; run;
You need not create a new set with your sort, I just don't want to mess with the supplied data set.
The sort in your case would be COUNTY instead of Make, and what every your count variable is in the second place. This groups the data with the largest count first for each county.
Then the data step using by group processing with First. to set a counter and select the records with desired order of appearance.
Hi ballardw,
Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317), Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.
Thank you
M
Don't write your own data step code to do this.
Use built-in code, that SAS has already created (for your benefit) and tested, and handles ties if they exist. For example
proc sort data=have;
by county;
run;
proc rank data=have out=want descending; by county; var count;
ranks count_rank;
run;
In the output data set, you want the observations with count_rank values 1 to 5.
@Malathi13 wrote:
Hi ballardw,
Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317), Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.
Thank you
M
Your original post said "I want to get 10 5 complaints for each county". That is what the code does if "top" means the largest count.
If you have a different requirement than be much more explicit, provide better examples of starting data and what the final result is supposed to look like.
Your list in this post is pretty much not helpful as we do not have any county identified, or what any of the counts you show.
Also, show the code you used. It looks almost like you sorted on "COMPLAINT" not the COUNT indicated you should.
Please ensure your sample data matches what you're looking for, in this case you didn't provide enough sample data.
I believe your problem is relatively straightforward though, see the code below.
Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?
Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?
@PaigeMiller wrote:
Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?
Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?
I recognized them as slightly different problems. At the end of the day though, the OP has multiple options here and needs to decide what's best. Because you don't know how many are in each groups, this is slightly harder - but not impossible to use PROC RANK.
When looking for Top N -> That's not a percentile measure, whereas PROC RANK is.
Top 5 to me, implies only 5 results whereas RANK would give you possibly more. You are correct in that it doesn't handle ties, which should be dealt with differently.
Ties will generate fractional ranks that can mess up your logic. For instance:
data test;
input x;
datalines;
1
2
2
2
2
2
2
2
2
2
2
3
4
5
;
proc rank data=test out=ranks;
var x; ranks rank;
run;
proc print data=ranks noobs; run;
x rank 1 1.0 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 3 12.0 4 13.0 5 14.0
@PGStats wrote:
Ties will generate fractional ranks that can mess up your logic. For instance:
But handling ties is a good thing when trying to figure out the top 5!!
And contrary to the example by @PGStats, there is an option in PROC RANK to say TIES=LOW, and then the logic doesn't get messed up, because in the example provided by PGStats, all the observations that have a value of 2 get a rank of 5.
When the requirement is to get 5 low (or top) values in the presence of ties, the tie handling options offered by proc rank (MEAN, LOW, HIGH) are not really useful. You can't say where rank <= 5 and get exactly 5 low (or top) values in every case.
mean low high Obs x Rank Rank Rank 1 1 1.0 1 1 2 2 6.5 2 11 3 2 6.5 2 11 4 2 6.5 2 11 5 2 6.5 2 11 6 2 6.5 2 11 7 2 6.5 2 11 8 2 6.5 2 11 9 2 6.5 2 11 10 2 6.5 2 11 11 2 6.5 2 11 12 3 12.0 12 12 13 4 13.0 13 13 14 5 14.0 14 14
We would need options TIES=RANDOM or TIES=DATA to get the expected sample size.
If there are ties when you find the fifth highest, you don't want EXACTLY 5 results, you want more than 5 results.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.