BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
stancemcgraw
Obsidian | Level 7

hi all,

   it has come to my attention that there is a dataset I have that has duplicate IDs, but each duplicate is actually a different patient. They are character IDs and I'd like to assign letters (a's, b's, & c's) to them to differentiate them so that I can eventually match merge different files into this main dataset. Is there a code that can assign some sort of value to duplicate IDs?

 

Data have:

Studyid      Age      Sex      HLOS

ID-1036       56        F           12

ID-1037       60       M             4

ID-1036       80       M            10

ID-1037       90      F             3

ID-1036        30     F             2

ID-1037       40      M             5

 

Data want:

Studyid      Age      Sex      HLOS

ID-1036a       56        F           12

ID-1037a       60       M             4

ID-1036b       80       M            10

ID-1037b       90      F             3

ID-1036c       30     F             2

ID-1037b        40      M             5

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @stancemcgraw,

 

As you can see from @ballardw's suggestion, this is no problem technically. But before taking a single step towards changing the IDs, I would rather take ten steps to investigate why there are duplicates (check source data, read documentation, ask co-workers, etc.).

 

IDs, in particular patient IDs, are crucial and must not be changed in an ad-hoc manner. They are likely to occur in several datasets and are typically used as key variables to join tables. (That is, a change in one dataset would require consistent changes in other datasets.) There is also a risk of incorrectly splitting the observations from a single patient by assigning different IDs. Age can change over time, errors in the data are possible and the same combination of Age and Sex may or may not belong to different patients.

 

One possible reason for duplicate IDs is that only the combination of two (or more) key variables is unique. For example, in multi-center clinical trials it is common to use the combination of center ID and patient number as a unique key on patient level. Duplicates in character variables can also result from truncation in an earlier step: What if you notice that all duplicates start with "ID-10," whereas the "ID-9..." cases are unique in a Studyid variable with length 7?

ID-9998
ID-9999
...
ID-10361
ID-10370
ID-10364
ID-10372
ID-10369
ID-10375

View solution in original post

2 REPLIES 2
ballardw
Super User

What is the assigned length of the variable in question? If the length is 7 then you can't append any suffix to a value like ID-1036 without creating a new variable as it just would not keep the suffix.

 

The typical approach would be to sort by the variable and then increment a counter.

 

Data have;
input Studyid $     Age      Sex   $   HLOS;
datalines;
ID-1036       56        F           12
ID-1037       60       M             4
ID-1036       80       M            10
ID-1037       90      F             3
ID-1036        30     F             2
ID-1037       40      M             5
;

proc sort data=have;
   by studyid;
run;

data want;
  set have (rename=(studyid=oldid));
  length studyid $ 10; /*< this should probably be at least 1 if not 2 characters longer*/
  retain counter 0;
  by oldid;
  if first.oldid then counter=0;
  else counter+1;
  studyid = cats(oldid,byte(counter+rank('a')));
  drop counter oldid;
run;

The BYTE function returns a single character from the character set, the RANK function returns the position in a character set of a specific character. So adding 0 to the position of 'a' allows returning the character 'a', adding 1 yields 'b', and so forth. IF you have more than 26 duplicates you are going to get some odd suffixes. The Cats function does the appending.

 

If you want to keep the order then you need to add an order variable before the sort and then sort by that order variable after the want data step. I'll leave that as an exercise for the interested reader. [It's on this forum in dozens of places]

FreelanceReinh
Jade | Level 19

Hi @stancemcgraw,

 

As you can see from @ballardw's suggestion, this is no problem technically. But before taking a single step towards changing the IDs, I would rather take ten steps to investigate why there are duplicates (check source data, read documentation, ask co-workers, etc.).

 

IDs, in particular patient IDs, are crucial and must not be changed in an ad-hoc manner. They are likely to occur in several datasets and are typically used as key variables to join tables. (That is, a change in one dataset would require consistent changes in other datasets.) There is also a risk of incorrectly splitting the observations from a single patient by assigning different IDs. Age can change over time, errors in the data are possible and the same combination of Age and Sex may or may not belong to different patients.

 

One possible reason for duplicate IDs is that only the combination of two (or more) key variables is unique. For example, in multi-center clinical trials it is common to use the combination of center ID and patient number as a unique key on patient level. Duplicates in character variables can also result from truncation in an earlier step: What if you notice that all duplicates start with "ID-10," whereas the "ID-9..." cases are unique in a Studyid variable with length 7?

ID-9998
ID-9999
...
ID-10361
ID-10370
ID-10364
ID-10372
ID-10369
ID-10375

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 2170 views
  • 2 likes
  • 3 in conversation