This is a subset of a larger data. The data was generated by a biochemist and I have an issue with the subjects' IDs.
"METAPH" is the name of the study. Subjects are measured at 3 occasions (S1-1, S3-1 and S5-1). The subject's ID is as follow: occasion-study'name-subject. For example the first subject has 3 IDs corresponding to the 3 measurements:
S1-1-METAPH-002
S3-1-METAPH-002
S5-1-METAPH-002
Data looks like it was manually entered:
SUBJECT
S1-1-METAPH-002
S1-1-METAPH-019
S1-1-METAPH-089
S1-1-METAPH-027
S1-1-METAPH-006
S1-1-METAPH-007
S1-1-METAPH-008
S1-1-METAPH-003
S3-1-METAPH-027
S3-1-METAPH-008
S3-1-METAPH-089
S3-1-METAPH-003
S3-1-METAPH-002
S3-1-METAPH-006
S3-1-METAPH-007
S3-1-METAPH-019
S5-1-METAPH-002
S5-1-METAPH-003
S5-1-METAPH-027
S5-1-METAPH-089
S5-1-METAPH-006
S5-1-METAPH-019
S5-1-METAPH-008
S5-1-METAPH-007
I would like to replace these 3 IDs by only one, which is 1, and the data to look like:
1
1
1
2
2
2
...
Thank you
If I understand you correctly, you want to get the last 3 digits from the subject variable, which identifies the patient. One way is this:
data want;
set have;
length id $3;
id=scan(subject,-1,'-');
run;
I don't understand this. Why 1, 1, 1 and 2, 2, 2 ?
As you can see the IDs are not sequential perhaps some subjects were exluded from the study. I would like to identify subjects by their numbers in the study and not by this long and confusing ID. I have another column that specifies the time of measurement (S1,S3 and S5) which I can use in my modeling. For example these 3 measurements of patient 002 I want to have 3 records called 002. In other word, I want to leave only the three digits to the far right of this long ID. I hope I answered your question.
Is that the only pattern that we need to be aware of? In other words, is it S[digit]-[number]? If so, this is a bit clunky but works for this instance.
data have;
input subject $20.;
datalines;
S1-1-METAPH-002
S1-1-METAPH-019
S1-1-METAPH-089
S1-1-METAPH-027
S1-1-METAPH-006
S1-1-METAPH-007
S1-1-METAPH-008
S1-1-METAPH-003
S3-1-METAPH-027
S3-1-METAPH-008
S3-1-METAPH-089
S3-1-METAPH-003
S3-1-METAPH-002
S3-1-METAPH-006
S3-1-METAPH-007
S3-1-METAPH-019
S5-1-METAPH-002
S5-1-METAPH-003
S5-1-METAPH-027
S5-1-METAPH-089
S5-1-METAPH-006
S5-1-METAPH-019
S5-1-METAPH-008
S5-1-METAPH-007
;
run;
proc sort
data = have;
by subject;
run;
data have_2;
set have;
id = catx("-", scan(subject, 1, '-'), scan(subject, 2, '-'));
run;
proc sort
data = have_2;
by id;
run;
data want;
set have_2;
by id;
retain new_id;
if first.id then new_id + 1;
run;
Obs subject id new_id 1 S1-1-METAPH-002 S1-1 1 2 S1-1-METAPH-003 S1-1 1 3 S1-1-METAPH-006 S1-1 1 4 S1-1-METAPH-007 S1-1 1 5 S1-1-METAPH-008 S1-1 1 6 S1-1-METAPH-019 S1-1 1 7 S1-1-METAPH-027 S1-1 1 8 S1-1-METAPH-089 S1-1 1 9 S3-1-METAPH-002 S3-1 2 10 S3-1-METAPH-003 S3-1 2 11 S3-1-METAPH-006 S3-1 2 12 S3-1-METAPH-007 S3-1 2 13 S3-1-METAPH-008 S3-1 2 14 S3-1-METAPH-019 S3-1 2 15 S3-1-METAPH-027 S3-1 2 16 S3-1-METAPH-089 S3-1 2 17 S5-1-METAPH-002 S5-1 3 18 S5-1-METAPH-003 S5-1 3 19 S5-1-METAPH-006 S5-1 3 20 S5-1-METAPH-007 S5-1 3 21 S5-1-METAPH-008 S5-1 3 22 S5-1-METAPH-019 S5-1 3 23 S5-1-METAPH-027 S5-1 3 24 S5-1-METAPH-089 S5-1 3
I didn't well explain what I needed. As you put it: S[digit]-[number]:
I needed only to keep the "number". I have another column in the dataset to control when measurement was taken. This is longitudinal study so I should have 3 identical ID number per subjet.
Alright, I misunderstood your question. Do you mind replacing the `.` in the id field in this DATALINES statement with exactly what you want?
data want;
input subject :$20. id;
datalines;
S1-1-METAPH-002 .
S1-1-METAPH-019 .
S1-1-METAPH-089 .
S1-1-METAPH-027 .
S1-1-METAPH-006 .
S1-1-METAPH-007 .
S1-1-METAPH-008 .
S1-1-METAPH-003 .
S3-1-METAPH-027 .
S3-1-METAPH-008 .
S3-1-METAPH-089 .
S3-1-METAPH-003 .
S3-1-METAPH-002 .
S3-1-METAPH-006 .
S3-1-METAPH-007 .
S3-1-METAPH-019 .
S5-1-METAPH-002 .
S5-1-METAPH-003 .
S5-1-METAPH-027 .
S5-1-METAPH-089 .
S5-1-METAPH-006 .
S5-1-METAPH-019 .
S5-1-METAPH-008 .
S5-1-METAPH-007 .
;
run;
You can copy this code, and in your reply use the `insert SAS code` button and replace the `.` with your desired values.
This will help us figure out the solution.
Sorry for not explaining straight to the point what I needed. It was so simple to say I want to keep the last 3 digits.
If I understand you correctly, you want to get the last 3 digits from the subject variable, which identifies the patient. One way is this:
data want;
set have;
length id $3;
id=scan(subject,-1,'-');
run;
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.