BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

I want to create a new variable that classify each customerID into a group between 1 and 100 by percentiles of salary amount.

Please note that I want that the percentiles will be calculated only on cusomers with positive salary (customers with salary zero will not influence on the percentiles calculation). So for example:

customers with salary 0 be in group 0

customers with salary greater than 0 and lower equal to percentile 1 will be in group 1

customers with salary greater than percentile 1 and lower equal to percentile 2 will be in group 2

and so on

customers with salary greater than percentile 99 and lower equal to percentile 100( max) will be in group 100

 

What is the way to do it please

 

 

 

9 REPLIES 9
mkeintz
PROC Star

@Ronein 

 

Nice question.  Where is the sample data, to help us help you?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

Without getting into a coding example, why not run PROC RANK (with groups=100) on the subset of salaries > 0, which can yield an output dataset with ranks of 0 through 99?

 

Merge that data with the original data, increasing rank by 1 for the positive salary subset, and assign a rank of zero to cases with salary=0.  You would then have rank values of 00 through 100.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Astounding
PROC Star

Here's a question you will need to address.  The methods you choose (whether PROC RANK or any other method) will not let you avoid answering this question. 

 

How do you want to handle ties?  To illustrate, let's say you have 100,000 observations.  But 5,000 of them have the exact same score.  How do you distribute them among percentiles?  Will all 5,000 go into the same percentile (leaving some other percentile with 0 observations)?  Or will you assign some of the 5,000 to one percentile and some of the 5,000 with the exact same score to a different percentile?

Ronein
Meteorite | Level 14

Hello,

Yes-  5,000 will go into the same percentile (leaving some other percentile with 0 observations)

What is the way to do it?

Astounding
PROC Star

Here's a reasonable way:

proc sort data=have;
by salary;
run;
proc summary data=have;
   where salary > 0;
   output out=n_salaries n=n_salaries;
run;
data want;
   if _n_=1 then set n_salaries;
   set have nobs=_nobs_;
   by salary;
   retain group;
   if salary > 0 then n + 1;
   if first.salary then do;
      if salary <= 0 then group = 0;
      else group = ceil(100 * n/n_salaries);
   end;
run;

It's untested, since there's no data.  But it should be in the ballpark.  It uses GROUP to represent the percentile, and it does allow a largest percentile of 100.

mkeintz
PROC Star

@Ronein wrote:

Hello,

Yes-  5,000 will go into the same percentile (leaving some other percentile with 0 observations)

What is the way to do it?


And which percentile will ties be assigned: lower bound?  upper bound? mid-point?

 

Let's say you want the mid-point, and data is sorted by ID:

 

proc rank data=have (keep=id salary  where=(salary>0));
  groups=100 ties=mean out=need;
  var salary;
  ranks salary_pctile;
run;

data want;
  merge have need;
  by id;
  if salary>0 then salary_pctile=salary_pctile+1;
  else if salary=0 then salary_pctile=0;
run; 

Untested in the absence of sample data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Astounding
PROC Star

@mkeintz ,

 

I'm usuallly right there with you, trying to use the simplest and most direct tools.  But this is a case where I feared PROC RANK would put us on the inexorable path toward 20 posts before hitting a solution.  Here's what I expected:

 

  1. Poster would actually try a PROC RANK solution, then complain that it didn't work.
  2. Someone would post that "didn't work" is awfully vague and would request a copy of the log.
  3. Poster would post the log, but as text so that it is difficult to read.
  4. Someone would post instructions on the right way to post a log.
  5. Poster would actually post the log in a readable form.
  6. Nothing would appear to be wrong, and someone would ask the original poster why s/he insists that it didn't work.
  7. Poster would eventually say that there is nobody assigned to percentile 1 or 2, and that the lowest salary starts with percentile 3.
  8. I would get to post why that happens and say, "That's what you asked for."
  9. Poster would reply that percentiles should be assigned differently...

You must get the idea by now.  Let me skip some of the process and jump right to the issue.

PROC RANK processes the 0 salary values into percentile 1.  Once this statement runs, nobody is left in the first percentile:  if salary = 0 then percentile = 0;

Percentile 2 might be a little light as well.

I was imagining an approach where percentiles get assigned based only on the positive salaries (still assigning salary=0 to percentile 0).  This could easily be achieved by cleaning the data now:  if salary = 0 then salary = .;  But for some reason, it seems the original poster is not allowed to do this.  Cleaning the data first would solve for:

  • percentile assignment using a simple PROC RANK
  • detecting other bad data.  For example if one data entry person used salary=0 for missing values, perhaps another used salary=-999.
  • duplicate entries for the same person.  If the current form of the data is acceptable, I'm not going to try to explain what happens in a many-to-many merge

I'm not claiming that my posted solution is best or even that it works.  LIke you, I don't have any data to use to test it.  Unlike you, I haven't had SAS available for a few years.  (No, I'm not in jail, just not motivated to fiddle with my ancient desktop machine.)  Once the data is clean, another viable approach (even with 0 representing missing values) might be to separate the data into two sets.  One holds salary=0 observations, and one holds salary > 0 observations.  Run the PROC RANK on the salary > 0 observations, then put the groups back together again.

Anyway, we'll see where this journey goes.  Best of luck to all of us along the way.

mkeintz
PROC Star

@Astounding 

 

"PROC RANK processes the 0 salary values into percentile 1. "

   and

"Run the PROC RANK on the salary > 0 observations, then put the groups back together again."

 

The latter is what my code effectively does using the "where=(salary>0)". filter to the data set submitted to PROC RANK, which of course necessitates the subsequent recovery of the unranked observations.   But instead of a separate WHERE statement it's coded as a data set name parameter.  I suspect you may have missed that.  Welcome to a club I frequently visit.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Astounding
PROC Star

Of course you're right here.  Time to invest on my side:  a new pair of bifocals and a new coffee maker.  Happy 2025 to all.

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1168 views
  • 5 likes
  • 3 in conversation