- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
hi all,
it has come to my attention that there is a dataset I have that has duplicate IDs, but each duplicate is actually a different patient. They are character IDs and I'd like to assign letters (a's, b's, & c's) to them to differentiate them so that I can eventually match merge different files into this main dataset. Is there a code that can assign some sort of value to duplicate IDs?
Data have:
Studyid Age Sex HLOS
ID-1036 56 F 12
ID-1037 60 M 4
ID-1036 80 M 10
ID-1037 90 F 3
ID-1036 30 F 2
ID-1037 40 M 5
Data want:
Studyid Age Sex HLOS
ID-1036a 56 F 12
ID-1037a 60 M 4
ID-1036b 80 M 10
ID-1037b 90 F 3
ID-1036c 30 F 2
ID-1037b 40 M 5
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @stancemcgraw,
As you can see from @ballardw's suggestion, this is no problem technically. But before taking a single step towards changing the IDs, I would rather take ten steps to investigate why there are duplicates (check source data, read documentation, ask co-workers, etc.).
IDs, in particular patient IDs, are crucial and must not be changed in an ad-hoc manner. They are likely to occur in several datasets and are typically used as key variables to join tables. (That is, a change in one dataset would require consistent changes in other datasets.) There is also a risk of incorrectly splitting the observations from a single patient by assigning different IDs. Age can change over time, errors in the data are possible and the same combination of Age and Sex may or may not belong to different patients.
One possible reason for duplicate IDs is that only the combination of two (or more) key variables is unique. For example, in multi-center clinical trials it is common to use the combination of center ID and patient number as a unique key on patient level. Duplicates in character variables can also result from truncation in an earlier step: What if you notice that all duplicates start with "ID-10," whereas the "ID-9..." cases are unique in a Studyid variable with length 7?
ID-9998 ID-9999 ... ID-10361 ID-10370 ID-10364 ID-10372 ID-10369 ID-10375
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What is the assigned length of the variable in question? If the length is 7 then you can't append any suffix to a value like ID-1036 without creating a new variable as it just would not keep the suffix.
The typical approach would be to sort by the variable and then increment a counter.
Data have; input Studyid $ Age Sex $ HLOS; datalines; ID-1036 56 F 12 ID-1037 60 M 4 ID-1036 80 M 10 ID-1037 90 F 3 ID-1036 30 F 2 ID-1037 40 M 5 ; proc sort data=have; by studyid; run; data want; set have (rename=(studyid=oldid)); length studyid $ 10; /*< this should probably be at least 1 if not 2 characters longer*/ retain counter 0; by oldid; if first.oldid then counter=0; else counter+1; studyid = cats(oldid,byte(counter+rank('a'))); drop counter oldid; run;
The BYTE function returns a single character from the character set, the RANK function returns the position in a character set of a specific character. So adding 0 to the position of 'a' allows returning the character 'a', adding 1 yields 'b', and so forth. IF you have more than 26 duplicates you are going to get some odd suffixes. The Cats function does the appending.
If you want to keep the order then you need to add an order variable before the sort and then sort by that order variable after the want data step. I'll leave that as an exercise for the interested reader. [It's on this forum in dozens of places]
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @stancemcgraw,
As you can see from @ballardw's suggestion, this is no problem technically. But before taking a single step towards changing the IDs, I would rather take ten steps to investigate why there are duplicates (check source data, read documentation, ask co-workers, etc.).
IDs, in particular patient IDs, are crucial and must not be changed in an ad-hoc manner. They are likely to occur in several datasets and are typically used as key variables to join tables. (That is, a change in one dataset would require consistent changes in other datasets.) There is also a risk of incorrectly splitting the observations from a single patient by assigning different IDs. Age can change over time, errors in the data are possible and the same combination of Age and Sex may or may not belong to different patients.
One possible reason for duplicate IDs is that only the combination of two (or more) key variables is unique. For example, in multi-center clinical trials it is common to use the combination of center ID and patient number as a unique key on patient level. Duplicates in character variables can also result from truncation in an earlier step: What if you notice that all duplicates start with "ID-10," whereas the "ID-9..." cases are unique in a Studyid variable with length 7?
ID-9998 ID-9999 ... ID-10361 ID-10370 ID-10364 ID-10372 ID-10369 ID-10375