Hello
I want to create a new variable that classify each customerID into a group between 1 and 100 by percentiles of salary amount.
Please note that I want that the percentiles will be calculated only on cusomers with positive salary (customers with salary zero will not influence on the percentiles calculation). So for example:
customers with salary 0 be in group 0
customers with salary greater than 0 and lower equal to percentile 1 will be in group 1
customers with salary greater than percentile 1 and lower equal to percentile 2 will be in group 2
and so on
customers with salary greater than percentile 99 and lower equal to percentile 100( max) will be in group 100
What is the way to do it please
Nice question. Where is the sample data, to help us help you?
Without getting into a coding example, why not run PROC RANK (with groups=100) on the subset of salaries > 0, which can yield an output dataset with ranks of 0 through 99?
Merge that data with the original data, increasing rank by 1 for the positive salary subset, and assign a rank of zero to cases with salary=0. You would then have rank values of 00 through 100.
Here's a question you will need to address. The methods you choose (whether PROC RANK or any other method) will not let you avoid answering this question.
How do you want to handle ties? To illustrate, let's say you have 100,000 observations. But 5,000 of them have the exact same score. How do you distribute them among percentiles? Will all 5,000 go into the same percentile (leaving some other percentile with 0 observations)? Or will you assign some of the 5,000 to one percentile and some of the 5,000 with the exact same score to a different percentile?
Hello,
Yes- 5,000 will go into the same percentile (leaving some other percentile with 0 observations)
What is the way to do it?
Here's a reasonable way:
proc sort data=have;
by salary;
run;
proc summary data=have;
where salary > 0;
output out=n_salaries n=n_salaries;
run;
data want;
if _n_=1 then set n_salaries;
set have nobs=_nobs_;
by salary;
retain group;
if salary > 0 then n + 1;
if first.salary then do;
if salary <= 0 then group = 0;
else group = ceil(100 * n/n_salaries);
end;
run;
It's untested, since there's no data. But it should be in the ballpark. It uses GROUP to represent the percentile, and it does allow a largest percentile of 100.
@Ronein wrote:
Hello,
Yes- 5,000 will go into the same percentile (leaving some other percentile with 0 observations)
What is the way to do it?
And which percentile will ties be assigned: lower bound? upper bound? mid-point?
Let's say you want the mid-point, and data is sorted by ID:
proc rank data=have (keep=id salary where=(salary>0));
groups=100 ties=mean out=need;
var salary;
ranks salary_pctile;
run;
data want;
merge have need;
by id;
if salary>0 then salary_pctile=salary_pctile+1;
else if salary=0 then salary_pctile=0;
run;
Untested in the absence of sample data.
@mkeintz ,
I'm usuallly right there with you, trying to use the simplest and most direct tools. But this is a case where I feared PROC RANK would put us on the inexorable path toward 20 posts before hitting a solution. Here's what I expected:
You must get the idea by now. Let me skip some of the process and jump right to the issue.
PROC RANK processes the 0 salary values into percentile 1. Once this statement runs, nobody is left in the first percentile: if salary = 0 then percentile = 0;
Percentile 2 might be a little light as well.
I was imagining an approach where percentiles get assigned based only on the positive salaries (still assigning salary=0 to percentile 0). This could easily be achieved by cleaning the data now: if salary = 0 then salary = .; But for some reason, it seems the original poster is not allowed to do this. Cleaning the data first would solve for:
I'm not claiming that my posted solution is best or even that it works. LIke you, I don't have any data to use to test it. Unlike you, I haven't had SAS available for a few years. (No, I'm not in jail, just not motivated to fiddle with my ancient desktop machine.) Once the data is clean, another viable approach (even with 0 representing missing values) might be to separate the data into two sets. One holds salary=0 observations, and one holds salary > 0 observations. Run the PROC RANK on the salary > 0 observations, then put the groups back together again.
Anyway, we'll see where this journey goes. Best of luck to all of us along the way.
"PROC RANK processes the 0 salary values into percentile 1. "
and
"Run the PROC RANK on the salary > 0 observations, then put the groups back together again."
The latter is what my code effectively does using the "where=(salary>0)". filter to the data set submitted to PROC RANK, which of course necessitates the subsequent recovery of the unranked observations. But instead of a separate WHERE statement it's coded as a data set name parameter. I suspect you may have missed that. Welcome to a club I frequently visit.
Of course you're right here. Time to invest on my side: a new pair of bifocals and a new coffee maker. Happy 2025 to all.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.