- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a dataset with county and complaint variables. I want to get the top 5 complaints by county.
E.g. County Complaint Count
ABC sick 100
DEF Sick 200
ABC Chest Pain 200
DEF Chest Pain 300
ABC Heart problem 500
I want to get 10 5 complaints for each county E.g for County ABC SICK, CHESTPAIN, HEARTPROBLEM, BREATHING PROBLEM and EYE PROBLEM are top 5 then I want the list of the county name with top 5 complaints and the count of it.
Can I get any help. I did proc freq and got the list of all complaint by County but not sure how to get the top 5 only for each county.
Thank you
M
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is one way using a SAS supplied dataset so you should be able to test the code.
proc sort data=sashelp.cars out=carsort; by make descending msrp; run; data want; set carsort; retain makecount; by make; if first.make then makecount=1; else makecount+1; if makecount le 5; run;
You need not create a new set with your sort, I just don't want to mess with the supplied data set.
The sort in your case would be COUNTY instead of Make, and what every your count variable is in the second place. This groups the data with the largest count first for each county.
Then the data step using by group processing with First. to set a counter and select the records with desired order of appearance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi ballardw,
Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317), Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.
Thank you
M
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Don't write your own data step code to do this.
Use built-in code, that SAS has already created (for your benefit) and tested, and handles ties if they exist. For example
proc sort data=have;
by county;
run;
proc rank data=have out=want descending; by county; var count;
ranks count_rank;
run;
In the output data set, you want the observations with count_rank values 1 to 5.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Malathi13 wrote:
Hi ballardw,
Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317), Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.
Thank you
M
Your original post said "I want to get 10 5 complaints for each county". That is what the code does if "top" means the largest count.
If you have a different requirement than be much more explicit, provide better examples of starting data and what the final result is supposed to look like.
Your list in this post is pretty much not helpful as we do not have any county identified, or what any of the counts you show.
Also, show the code you used. It looks almost like you sorted on "COMPLAINT" not the COUNT indicated you should.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please ensure your sample data matches what you're looking for, in this case you didn't provide enough sample data.
I believe your problem is relatively straightforward though, see the code below.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?
Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PaigeMiller wrote:
Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?
Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?
I recognized them as slightly different problems. At the end of the day though, the OP has multiple options here and needs to decide what's best. Because you don't know how many are in each groups, this is slightly harder - but not impossible to use PROC RANK.
When looking for Top N -> That's not a percentile measure, whereas PROC RANK is.
Top 5 to me, implies only 5 results whereas RANK would give you possibly more. You are correct in that it doesn't handle ties, which should be dealt with differently.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ties will generate fractional ranks that can mess up your logic. For instance:
data test;
input x;
datalines;
1
2
2
2
2
2
2
2
2
2
2
3
4
5
;
proc rank data=test out=ranks;
var x; ranks rank;
run;
proc print data=ranks noobs; run;
x rank 1 1.0 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 2 6.5 3 12.0 4 13.0 5 14.0
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PGStats wrote:
Ties will generate fractional ranks that can mess up your logic. For instance:
But handling ties is a good thing when trying to figure out the top 5!!
And contrary to the example by @PGStats, there is an option in PROC RANK to say TIES=LOW, and then the logic doesn't get messed up, because in the example provided by PGStats, all the observations that have a value of 2 get a rank of 5.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When the requirement is to get 5 low (or top) values in the presence of ties, the tie handling options offered by proc rank (MEAN, LOW, HIGH) are not really useful. You can't say where rank <= 5 and get exactly 5 low (or top) values in every case.
mean low high Obs x Rank Rank Rank 1 1 1.0 1 1 2 2 6.5 2 11 3 2 6.5 2 11 4 2 6.5 2 11 5 2 6.5 2 11 6 2 6.5 2 11 7 2 6.5 2 11 8 2 6.5 2 11 9 2 6.5 2 11 10 2 6.5 2 11 11 2 6.5 2 11 12 3 12.0 12 12 13 4 13.0 13 13 14 5 14.0 14 14
We would need options TIES=RANDOM or TIES=DATA to get the expected sample size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If there are ties when you find the fifth highest, you don't want EXACTLY 5 results, you want more than 5 results.
Paige Miller