@mkeintz ,
I'm usuallly right there with you, trying to use the simplest and most direct tools. But this is a case where I feared PROC RANK would put us on the inexorable path toward 20 posts before hitting a solution. Here's what I expected:
Poster would actually try a PROC RANK solution, then complain that it didn't work.
Someone would post that "didn't work" is awfully vague and would request a copy of the log.
Poster would post the log, but as text so that it is difficult to read.
Someone would post instructions on the right way to post a log.
Poster would actually post the log in a readable form.
Nothing would appear to be wrong, and someone would ask the original poster why s/he insists that it didn't work.
Poster would eventually say that there is nobody assigned to percentile 1 or 2, and that the lowest salary starts with percentile 3.
I would get to post why that happens and say, "That's what you asked for."
Poster would reply that percentiles should be assigned differently...
You must get the idea by now. Let me skip some of the process and jump right to the issue.
PROC RANK processes the 0 salary values into percentile 1. Once this statement runs, nobody is left in the first percentile: if salary = 0 then percentile = 0;
Percentile 2 might be a little light as well.
I was imagining an approach where percentiles get assigned based only on the positive salaries (still assigning salary=0 to percentile 0). This could easily be achieved by cleaning the data now: if salary = 0 then salary = .; But for some reason, it seems the original poster is not allowed to do this. Cleaning the data first would solve for:
percentile assignment using a simple PROC RANK
detecting other bad data. For example if one data entry person used salary=0 for missing values, perhaps another used salary=-999.
duplicate entries for the same person. If the current form of the data is acceptable, I'm not going to try to explain what happens in a many-to-many merge
I'm not claiming that my posted solution is best or even that it works. LIke you, I don't have any data to use to test it. Unlike you, I haven't had SAS available for a few years. (No, I'm not in jail, just not motivated to fiddle with my ancient desktop machine.) Once the data is clean, another viable approach (even with 0 representing missing values) might be to separate the data into two sets. One holds salary=0 observations, and one holds salary > 0 observations. Run the PROC RANK on the salary > 0 observations, then put the groups back together again.
Anyway, we'll see where this journey goes. Best of luck to all of us along the way.
... View more