Hi all,
I have a variable called subject_ID composed of Site ID (first one or two number are Site and the last 4 digits are subject ID) and saved as text and I want to add a "-" in between, like 12-0006, 9-0011 or 22-0001 for example, any idea is greatly appreciated!
Subject_ID
120006 |
120004 |
120003 |
120002 |
120005 |
90011 |
90010 |
90006 |
90001 |
90009 |
220001 |
220019 |
220017 |
If the variable is character what is the current defined length of the variable? What is the longest current value in the variable?
If the two numbers above match then you will need to do something to increase the length of the variable because inserting would make the variable too long to fit in the current definition.
Something like:
data want; length subject_id $ 7; set have; if length(subject_id)=6 then subject_id = catx('-',substr(subject_id,1,2),substr(subject_id,3) ;
else if length(subject_id)=5 then subject_id = catx('-',substr(subject_id,1,1),substr(subject_id,2) ; run;
I use 7 above as the longest value you showed was 6 characters. If the actual defined length is longer than the longest value the length statement would not be needed.
Edited to address the 5 character length option I missed.
If your variable is NUMERIC then you will need to create an entirely new variable as you cannot have - as you would intend. I suspect I might be able to create a format that would display the current numeric as desired but that sounds fragile.
Who combined these values? For a large number of purposes it may be easier to use Site and the 4 digits (or other identifier) as separate variables (of course if they are already in the data let us know).
Is Subject_ID a numeric or character variable?
@PaigeMiller character
So is there a rule that would allow us to separate the cases where you have a single digit before the dash, and the cases where you have two digits before the dash?
@PaigeMiller the last 4 digits are the patient ID, it always has 4 didgits, the first 1 or 2 digits are the site iD, when the sites run over 10, it becomes 2 digits. So basically it is to insert a "-" in the 5th place from the right, hope this is clear
@zimcom wrote:
@PaigeMiller the last 4 digits are the patient ID, it always has 4 didgits, the first 1 or 2 digits are the site iD, when the sites run over 10, it becomes 2 digits. So basically it is to insert a "-" in the 5th place from the right, hope this is clear
data want;
set have;
length new_subject_id $ 8;
len=length(subject_id);
new_subject_id=catx('-',substr(subject_id,1,len-4),substr(subject_id,len-3));
run;
Or code as below. Eventually add some check on the length of the source string to deal with incomplete source data.
data have(drop=_:);
infile datalines truncover;
input Subject_ID $;
length want $ 100;
_len=lengthn(Subject_ID);
want=substrn(Subject_ID,1,_len-4)||'-'||substrn(Subject_ID,_len-3);
cards;
87654321
7654321
654321
54321
4321
321
21
1
;
data have;
input Subject_ID $;
cards;
120006
120004
120003
120002
120005
90011
90010
90006
90001
90009
220001
220019
220017
;
data want;
set have;
temp=indexc(Subject_ID,'0');
want=catx('-',substr(Subject_ID,1,temp-1),substr(subject_id,temp));
drop temp;
run;
I made some silly assumptions though 🙂
@novinosrin That works if the 0 is the delimiter where the dash goes. Otherwise not.
@PaigeMiller Yes you are right.
@zimcom The following works with no assumptions
data have;
input Subject_ID $;
cards;
120006
120004
120003
120002
120005
90011
90010
90006
90001
90009
220001
220019
220017
;
data want;
set have;
temp=substr(Subject_ID,length(Subject_ID)-4+1,4);
temp1=substr(Subject_ID,1,length(Subject_ID)-4);
want=catx('-',temp1,temp);
drop temp:;
run;
If the variable is character what is the current defined length of the variable? What is the longest current value in the variable?
If the two numbers above match then you will need to do something to increase the length of the variable because inserting would make the variable too long to fit in the current definition.
Something like:
data want; length subject_id $ 7; set have; if length(subject_id)=6 then subject_id = catx('-',substr(subject_id,1,2),substr(subject_id,3) ;
else if length(subject_id)=5 then subject_id = catx('-',substr(subject_id,1,1),substr(subject_id,2) ; run;
I use 7 above as the longest value you showed was 6 characters. If the actual defined length is longer than the longest value the length statement would not be needed.
Edited to address the 5 character length option I missed.
If your variable is NUMERIC then you will need to create an entirely new variable as you cannot have - as you would intend. I suspect I might be able to create a format that would display the current numeric as desired but that sounds fragile.
Who combined these values? For a large number of purposes it may be easier to use Site and the 4 digits (or other identifier) as separate variables (of course if they are already in the data let us know).
The solution from @ballardw fails for the case where the Subject_ID is 90011
data have;
input Subject_ID $;
length want $ 100;
want=prxchange('s/(\d{4})$/-$1/',1,strip(Subject_ID));
cards;
120006
120004
120003
120002
120005
90011
90010
90006
90001
90009
220001
220019
220017
;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.