Solved: Assign values to a duplicate ID

stancemcgraw · Posted 10-22-2020 05:21 PM

hi all,

it has come to my attention that there is a dataset I have that has duplicate IDs, but each duplicate is actually a different patient. They are character IDs and I'd like to assign letters (a's, b's, & c's) to them to differentiate them so that I can eventually match merge different files into this main dataset. Is there a code that can assign some sort of value to duplicate IDs?

Data have:

Studyid Age Sex HLOS

ID-1036 56 F 12

ID-1037 60 M 4

ID-1036 80 M 10

ID-1037 90 F 3

ID-1036 30 F 2

ID-1037 40 M 5

Data want:

Studyid Age Sex HLOS

ID-1036a 56 F 12

ID-1037a 60 M 4

ID-1036b 80 M 10

ID-1037b 90 F 3

ID-1036c 30 F 2

ID-1037b 40 M 5

FreelanceReinh · Posted 10-23-2020 04:51 AM

Hi @stancemcgraw,

As you can see from @ballardw's suggestion, this is no problem technically. But before taking a single step towards changing the IDs, I would rather take ten steps to investigate why there are duplicates (check source data, read documentation, ask co-workers, etc.).

IDs, in particular patient IDs, are crucial and must not be changed in an ad-hoc manner. They are likely to occur in several datasets and are typically used as key variables to join tables. (That is, a change in one dataset would require consistent changes in other datasets.) There is also a risk of incorrectly splitting the observations from a single patient by assigning different IDs. Age can change over time, errors in the data are possible and the same combination of Age and Sex may or may not belong to different patients.

One possible reason for duplicate IDs is that only the combination of two (or more) key variables is unique. For example, in multi-center clinical trials it is common to use the combination of center ID and patient number as a unique key on patient level. Duplicates in character variables can also result from truncation in an earlier step: What if you notice that all duplicates start with "ID-10," whereas the "ID-9..." cases are unique in a Studyid variable with length 7?

ID-9998
ID-9999
...
ID-10361
ID-10370
ID-10364
ID-10372
ID-10369
ID-10375

View solution in original post

ballardw · Posted 10-22-2020 06:48 PM

What is the assigned length of the variable in question? If the length is 7 then you can't append any suffix to a value like ID-1036 without creating a new variable as it just would not keep the suffix.

The typical approach would be to sort by the variable and then increment a counter.

Data have;
input Studyid $     Age      Sex   $   HLOS;
datalines;
ID-1036       56        F           12
ID-1037       60       M             4
ID-1036       80       M            10
ID-1037       90      F             3
ID-1036        30     F             2
ID-1037       40      M             5
;

proc sort data=have;
   by studyid;
run;

data want;
  set have (rename=(studyid=oldid));
  length studyid $ 10; /*< this should probably be at least 1 if not 2 characters longer*/
  retain counter 0;
  by oldid;
  if first.oldid then counter=0;
  else counter+1;
  studyid = cats(oldid,byte(counter+rank('a')));
  drop counter oldid;
run;

The BYTE function returns a single character from the character set, the RANK function returns the position in a character set of a specific character. So adding 0 to the position of 'a' allows returning the character 'a', adding 1 yields 'b', and so forth. IF you have more than 26 duplicates you are going to get some odd suffixes. The Cats function does the appending.

If you want to keep the order then you need to add an order variable before the sort and then sort by that order variable after the want data step. I'll leave that as an exercise for the interested reader. [It's on this forum in dozens of places]

FreelanceReinh · Posted 10-23-2020 04:51 AM

Hi @stancemcgraw,

As you can see from @ballardw's suggestion, this is no problem technically. But before taking a single step towards changing the IDs, I would rather take ten steps to investigate why there are duplicates (check source data, read documentation, ask co-workers, etc.).

IDs, in particular patient IDs, are crucial and must not be changed in an ad-hoc manner. They are likely to occur in several datasets and are typically used as key variables to join tables. (That is, a change in one dataset would require consistent changes in other datasets.) There is also a risk of incorrectly splitting the observations from a single patient by assigning different IDs. Age can change over time, errors in the data are possible and the same combination of Age and Sex may or may not belong to different patients.

One possible reason for duplicate IDs is that only the combination of two (or more) key variables is unique. For example, in multi-center clinical trials it is common to use the combination of center ID and patient number as a unique key on patient level. Duplicates in character variables can also result from truncation in an earlier step: What if you notice that all duplicates start with "ID-10," whereas the "ID-9..." cases are unique in a Studyid variable with length 7?

ID-9998
ID-9999
...
ID-10361
ID-10370
ID-10364
ID-10372
ID-10369
ID-10375

Assign values to a duplicate ID

Re: Assign values to a duplicate ID

Re: Assign values to a duplicate ID

Re: Assign values to a duplicate ID

Assign values to a duplicate ID

Re: Assign values to a duplicate ID

Re: Assign values to a duplicate ID

Re: Assign values to a duplicate ID

Click image to register for webinar

Classroom Training Available!