DATA Step, Macro, Functions and more

Top 5 by observations per each county

Reply
Contributor
Posts: 73

Top 5 by observations per each county

Hi,

I have a dataset with county and complaint variables. I want to get the top 5 complaints by county.

 

E.g.   County       Complaint                         Count

          ABC          sick                                     100

          DEF          Sick                                   200

          ABC          Chest Pain                          200

          DEF          Chest Pain                          300

          ABC         Heart problem                    500

 

I want to get 10 5 complaints for each county  E.g for County ABC SICK, CHESTPAIN, HEARTPROBLEM, BREATHING PROBLEM and EYE PROBLEM are top 5 then I want the list of the county name with top 5 complaints and the count of it.

 

Can I get any help. I did proc freq and got the list of all complaint by County but not sure how to get the top 5 only for each county.

 

Thank you

M

Super User
Posts: 13,358

Re: Top 5 by observations per each county

Posted in reply to Malathi13

Here is one way using a SAS supplied dataset so you should be able to test the code.

proc sort data=sashelp.cars out=carsort;
  by make descending msrp;
run;

data want;
  set carsort;
  retain makecount;
  by make;
  if first.make then makecount=1;
  else makecount+1;
  if makecount le 5;
run;

You need not create a new set with your sort, I just don't want to mess with the supplied data set.

 

The sort in your case would be COUNTY instead of Make, and what every your count variable is in the second place. This groups the data with the largest count first for each county.

Then the data step using by group processing with First. to set a counter and select the records with desired order of appearance.

Contributor
Posts: 73

Re: Top 5 by observations per each county

Hi ballardw,

Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317),  Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.

 

 

Thank you

M

 

 

 

Respected Advisor
Posts: 2,845

Re: Top 5 by observations per each county

[ Edited ]
Posted in reply to Malathi13

Don't write your own data step code to do this.

 

Use built-in code, that SAS has already created (for your benefit) and tested, and handles ties if they exist. For example

 

proc sort data=have;
by county;
run;
proc rank data=have out=want descending; by county; var count;
ranks count_rank;
run;

In the output data set, you want the observations with count_rank values 1 to 5.

--
Paige Miller
Super User
Posts: 13,358

Re: Top 5 by observations per each county

Posted in reply to Malathi13

Malathi13 wrote:

Hi ballardw,

Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317),  Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.

 

 

Thank you

M

 

 


Your original post said "I want to get 10 5 complaints for each county". That is what the code does if "top" means the largest count.

If you have a different requirement than be much more explicit, provide better examples of starting data and what the final result is supposed to look like.

Your list in this post is pretty much not helpful as we do not have any county identified, or what any of the counts you show.

Also, show the code you used. It looks almost like you sorted on "COMPLAINT" not the COUNT indicated you should.

Super User
Posts: 23,357

Re: Top 5 by observations per each county

Posted in reply to Malathi13

Please ensure your sample data matches what you're looking for, in this case you didn't provide enough sample data. 

 

I believe your problem is relatively straightforward though, see the code below.

 

Screen Shot 2017-12-22 at 12.52.14 PM.png

Respected Advisor
Posts: 2,845

Re: Top 5 by observations per each county

[ Edited ]

Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?

 

Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?

--
Paige Miller
Super User
Posts: 23,357

Re: Top 5 by observations per each county

Posted in reply to PaigeMiller

PaigeMiller wrote:

Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?

 

Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?


I recognized them as slightly different problems. At the end of the day though, the OP has multiple options here and needs to decide what's best. Because you don't know how many are in each groups, this is slightly harder - but not impossible to use PROC RANK. 

 

When looking for Top N -> That's not a percentile measure, whereas PROC RANK is. 

Top 5 to me, implies only 5 results whereas RANK would give you possibly more. You are correct in that it doesn't handle ties, which should be dealt with differently.

 

 

 

 

Esteemed Advisor
Posts: 5,487

Re: Top 5 by observations per each county

Posted in reply to PaigeMiller

Ties will generate fractional ranks that can mess up your logic. For instance:

 

data test;
input x;
datalines;
1
2
2
2
2
2
2
2
2
2
2
3
4
5
;

proc rank data=test out=ranks;
var x; ranks rank;
run;


proc print data=ranks noobs; run;
                                    x    rank

                                     1     1.0
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     2     6.5
                                     3    12.0
                                     4    13.0
                                     5    14.0
PG
Respected Advisor
Posts: 2,845

Re: Top 5 by observations per each county

[ Edited ]

PGStats wrote:

Ties will generate fractional ranks that can mess up your logic. For instance:


But handling ties is a good thing when trying to figure out the top 5!!

 

And contrary to the example by @PGStats, there is an option in PROC RANK to say TIES=LOW, and then the logic doesn't get messed up, because in the example provided by PGStats, all the observations that have a value of 2 get a rank of 5.

--
Paige Miller
Esteemed Advisor
Posts: 5,487

Re: Top 5 by observations per each county

Posted in reply to PaigeMiller

When the requirement is to get 5 low (or top) values in the presence of ties, the tie handling options offered by proc rank (MEAN, LOW, HIGH) are not really useful. You can't say where rank <= 5 and get exactly 5 low (or top) values in every case.

 

                                      mean     low    high
                          Obs    x    Rank    Rank    Rank

                            1    1     1.0      1       1
                            2    2     6.5      2      11
                            3    2     6.5      2      11
                            4    2     6.5      2      11
                            5    2     6.5      2      11
                            6    2     6.5      2      11
                            7    2     6.5      2      11
                            8    2     6.5      2      11
                            9    2     6.5      2      11
                           10    2     6.5      2      11
                           11    2     6.5      2      11
                           12    3    12.0     12      12
                           13    4    13.0     13      13
                           14    5    14.0     14      14

We would need options TIES=RANDOM or TIES=DATA to get the expected sample size.

 

PG
Respected Advisor
Posts: 2,845

Re: Top 5 by observations per each county

If there are ties when you find the fifth highest, you don't want EXACTLY 5 results, you want more than 5 results.

--
Paige Miller
Ask a Question
Discussion stats
  • 11 replies
  • 224 views
  • 2 likes
  • 5 in conversation