Re: Top 5 by observations per each county

Malathi13 · Posted 12-22-2017 11:57 AM

Hi,

I have a dataset with county and complaint variables. I want to get the top 5 complaints by county.

E.g. County Complaint Count

ABC sick 100

DEF Sick 200

ABC Chest Pain 200

DEF Chest Pain 300

ABC Heart problem 500

I want to get 10 5 complaints for each county E.g for County ABC SICK, CHESTPAIN, HEARTPROBLEM, BREATHING PROBLEM and EYE PROBLEM are top 5 then I want the list of the county name with top 5 complaints and the count of it.

Can I get any help. I did proc freq and got the list of all complaint by County but not sure how to get the top 5 only for each county.

Thank you

M

ballardw · Posted 12-22-2017 12:14 PM

Here is one way using a SAS supplied dataset so you should be able to test the code.

proc sort data=sashelp.cars out=carsort;
  by make descending msrp;
run;

data want;
  set carsort;
  retain makecount;
  by make;
  if first.make then makecount=1;
  else makecount+1;
  if makecount le 5;
run;

You need not create a new set with your sort, I just don't want to mess with the supplied data set.

The sort in your case would be COUNTY instead of Make, and what every your count variable is in the second place. This groups the data with the largest count first for each county.

Then the data step using by group processing with First. to set a counter and select the records with desired order of appearance.

Malathi13 · Posted 12-22-2017 12:33 PM

Hi ballardw,

Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317), Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.

Thank you

M

PaigeMiller · Posted 12-22-2017 01:05 PM

Don't write your own data step code to do this.

Use built-in code, that SAS has already created (for your benefit) and tested, and handles ties if they exist. For example

proc sort data=have;
    by county;
run;
proc rank data=have out=want descending;
    by county;
    var count;
    ranks count_rank;
run;

In the output data set, you want the observations with count_rank values 1 to 5.

--
Paige Miller

ballardw · Posted 12-22-2017 01:24 PM

@Malathi13 wrote:

Hi ballardw,

Thank you for the quick response, I tried your code, it is giving me the 5 records for each county but they are not the top five. For e.g. for ABC county I have breathing (391), Unknown problem (342),sick person (317), Falls (220), chest pain (213) as top five but the output is giving me Unknown Problem (342), Unconscious (199), Trauma (15), Traffic (142) and Stroke (81). I want the top highest complaints and not just the top 5 in order.

Thank you

M

Your original post said "I want to get 10 5 complaints for each county". That is what the code does if "top" means the largest count.

If you have a different requirement than be much more explicit, provide better examples of starting data and what the final result is supposed to look like.

Your list in this post is pretty much not helpful as we do not have any county identified, or what any of the counts you show.

Also, show the code you used. It looks almost like you sorted on "COMPLAINT" not the COUNT indicated you should.

Reeza · Posted 12-22-2017 02:53 PM

Please ensure your sample data matches what you're looking for, in this case you didn't provide enough sample data.

I believe your problem is relatively straightforward though, see the code below.

PaigeMiller · Posted 12-22-2017 08:48 PM

Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?

Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?

--
Paige Miller

Reeza · Posted 12-23-2017 12:22 AM

@PaigeMiller wrote:

Why, @Reeza, in a post earlier this week, you recommend PROC RANK and then in this post you write your own data step code to do the same thing?

Why, when you point out earlier this week that PROC RANK handles ties, do you write data step code that does not handle ties?

I recognized them as slightly different problems. At the end of the day though, the OP has multiple options here and needs to decide what's best. Because you don't know how many are in each groups, this is slightly harder - but not impossible to use PROC RANK.

When looking for Top N -> That's not a percentile measure, whereas PROC RANK is.

Top 5 to me, implies only 5 results whereas RANK would give you possibly more. You are correct in that it doesn't handle ties, which should be dealt with differently.

PGStats · Posted 12-23-2017 01:29 AM

Ties will generate fractional ranks that can mess up your logic. For instance:

data test;
input x;
datalines;
1
2
2
2
2
2
2
2
2
2
2
3
4
5
;

proc rank data=test out=ranks;
var x; ranks rank;
run;


proc print data=ranks noobs; run;

PG

PaigeMiller · Posted 12-23-2017 07:25 AM

@PGStats wrote:

Ties will generate fractional ranks that can mess up your logic. For instance:

But handling ties is a good thing when trying to figure out the top 5!!

And contrary to the example by @PGStats, there is an option in PROC RANK to say TIES=LOW, and then the logic doesn't get messed up, because in the example provided by PGStats, all the observations that have a value of 2 get a rank of 5.

--
Paige Miller

PGStats · Posted 12-23-2017 03:37 PM

When the requirement is to get 5 low (or top) values in the presence of ties, the tie handling options offered by proc rank (MEAN, LOW, HIGH) are not really useful. You can't say where rank <= 5 and get exactly 5 low (or top) values in every case.

                                      mean     low    high
                          Obs    x    Rank    Rank    Rank

                            1    1     1.0      1       1
                            2    2     6.5      2      11
                            3    2     6.5      2      11
                            4    2     6.5      2      11
                            5    2     6.5      2      11
                            6    2     6.5      2      11
                            7    2     6.5      2      11
                            8    2     6.5      2      11
                            9    2     6.5      2      11
                           10    2     6.5      2      11
                           11    2     6.5      2      11
                           12    3    12.0     12      12
                           13    4    13.0     13      13
                           14    5    14.0     14      14

We would need options TIES=RANDOM or TIES=DATA to get the expected sample size.

PG

PaigeMiller · Posted 12-23-2017 05:17 PM

If there are ties when you find the fifth highest, you don't want EXACTLY 5 results, you want more than 5 results.

--
Paige Miller

Classroom Training Available!