I ran into a challenge, hope experts can advise.
In a data set, we have case IDs (1, 2, etc), and the next column is dates of accident:
1.00 06/04/2008
1.00 06/08/2008
2.00 06/14/2008
3.00 08/23/2008
3.00 08/23/2008
4.00 08/23/2008
Here, if the date is the same withing the same ID, it is the same accident. So for ID=3, it will be the same accident, just 2 records for different aspects of the accident.
We need to assign a number to each unique accident. So for ID=1, it will be 1 and 2. For ID=2, will be 1. For ID=3, 1 and 2. And for ID=4, 1.
The issue is that when I use a rank function to rank dates by ID, the result is weird. For some cases, the rank is 11.5, 23.5 etc.
Is there a way to make it work so that for each case ID it will rank order just the dates within that ID?
Did you read the documentation of PROC RANK?
specifies how to compute normal scores or ranks for tied data values.
assigns the largest of the corresponding ranks (or largest of the normal scores when NORMAL= is specified).
assigns the smallest of the corresponding ranks (or smallest of the normal scores when NORMAL= is specified).
assigns the mean of the corresponding rank (or mean of the normal scores when NORMAL= is specified).
computes scores and ranks by treating tied values as a single-order statistic. For the default method, ranks are consecutive integers that begin with the number one and end with the number of unique, nonmissing values of the variable that is being ranked. Tied values are assigned the same rank.
Just a simple numbering of the cases should work. Just increment the number every time you see a new date.
data want;
set have ;
by id date;
if first.id then casenum=0;
casenum + first.date ;
run;
It sounds like you used proc rank which will deal with ties in a number of ways depending on the options chosen. The default is MEAN which would tend to generate values with a .5 when there are two identical dates. One suspect that you did not use a BY ID for the range of values mention.
Perhaps: May need to sort by ID and date if not actually so in your data.
data have; input id date :mmddyy10.; format date mmddyy10.; datalines; 1.00 06/04/2008 1.00 06/08/2008 2.00 06/14/2008 3.00 08/23/2008 3.00 08/23/2008 4.00 08/23/2008 ; proc rank data=have out=ranked ties=low; by id; ranks daterank; var date ; run;
Note your description "So for ID=3, it will be the same accident, just 2 records for different aspects of the accident.
.... For ID=3, 1 and 2." seems inconsistent. Wouldn't "same accident" be the same rank?
Thank you! Yes, you are right - for ID=3, should be 1 for both dates.
I tried your solution. It gets rid of .5 values. However, it does not assign the values consecutively, which is evident when the number of accidents is higher:
ID date rank
10.00 09/19/2006 1.00
10.00 09/19/2006 1.00
10.00 10/17/2006 5.00
10.00 10/17/2006 5.00
10.00 10/30/2006 8.00
10.00 10/30/2006 8.00
10.00 11/06/2006 11.00
10.00 11/06/2006 11.00
10.00 12/13/2006 13.00
10.00 12/13/2006 13.00
Instead, we need:
ID date rank
10.00 09/19/2006 1.00
10.00 09/19/2006 1.00
10.00 10/17/2006 2.00
10.00 10/17/2006 2.00
10.00 10/30/2006 3.00
10.00 10/30/2006 3.00
10.00 10/30/2006 3.00
10.00 11/06/2006 4.00
10.00 11/06/2006 4.00
10.00 12/13/2006 5.00
10.00 12/13/2006 5.00
So for ID=10 there were 5 distinct accidents in total, and should be numbered 1-5, not 1-13.
Is there a way to do that?
Did you read the documentation of PROC RANK?
specifies how to compute normal scores or ranks for tied data values.
assigns the largest of the corresponding ranks (or largest of the normal scores when NORMAL= is specified).
assigns the smallest of the corresponding ranks (or smallest of the normal scores when NORMAL= is specified).
assigns the mean of the corresponding rank (or mean of the normal scores when NORMAL= is specified).
computes scores and ranks by treating tied values as a single-order statistic. For the default method, ranks are consecutive integers that begin with the number one and end with the number of unique, nonmissing values of the variable that is being ranked. Tied values are assigned the same rank.
It did work with the Dense option produced what we wanted.
thank you so much, I greatly appreciate your help!!!
@Nia2023 wrote:
Thank you! Yes, you are right - for ID=3, should be 1 for both dates.
I tried your solution. It gets rid of .5 values. However, it does not assign the values consecutively, which is evident when the number of accidents is higher:
ID date rank
10.00 09/19/2006 1.00
10.00 09/19/2006 1.00
10.00 10/17/2006 5.00
10.00 10/17/2006 5.00
10.00 10/30/2006 8.00
10.00 10/30/2006 8.00
10.00 11/06/2006 11.00
10.00 11/06/2006 11.00
10.00 12/13/2006 13.00
10.00 12/13/2006 13.00
Instead, we need:
ID date rank
10.00 09/19/2006 1.00
10.00 09/19/2006 1.00
10.00 10/17/2006 2.00
10.00 10/17/2006 2.00
10.00 10/30/2006 3.00
10.00 10/30/2006 3.0010.00 10/30/2006 3.00
10.00 11/06/2006 4.00
10.00 11/06/2006 4.00
10.00 12/13/2006 5.00
10.00 12/13/2006 5.00
So for ID=10 there were 5 distinct accidents in total, and should be numbered 1-5, not 1-13.
Is there a way to do that?
Do please compare your first example with this data as well. You can oversimplify as in this case. Also, any time you get a result that doesn't match your expectations then show the CODE that you use as well as data.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.