I currently have a dataset that looks like this:
pat_id | time | sequence |
1 | 1 | M |
1 | 2 | MS |
1 | 3 | M |
1 | 4 | M |
1 | 5 | M |
1 | 6 | M |
1 | 7 | M |
1 | 8 | M |
1 | 9 | M |
1 | 10 | M |
2 | 1 | M |
2 | 2 | MST |
2 | 3 | ST |
2 | 4 | MT |
2 | 5 | M |
2 | 6 | MS |
2 | 7 | S |
2 | 8 | S |
2 | 9 | S |
2 | 10 | S |
I am trying to make it so that whenever there is a repeat observation from line to line it only records the first observation of that value. I'm looking speciifcally at the sequence variable. For example, I would like to keep only the first three rows of data for patient 1, so I would like to basically get rid of all the 'M' observations for time 4-10.
Ultimately I would like to transpose the data and make a string variable, so I'd like to get my final dataset to look like this, but I'm hung up on the above part....
pat_id | sequence |
1 | M-MS-M |
2 | M-MST-ST-MT-M-MS-S |
Any thoughts? Thank you!
Use the notsorted option on your by statement and append the first time you see a new sequence. Output on the last patient ID.
data have;
input pat_id time sequence $;
cards;
1 1 M
1 2 MS
1 3 M
1 4 M
1 5 M
1 6 M
1 7 M
1 8 M
1 9 M
1 10 M
2 1 M
2 2 MST
2 3 ST
2 4 MT
2 5 M
2 6 MS
2 7 S
2 8 S
2 9 S
2 10 S
;
run;
data want;
set have;
by pat_id sequence notsorted;
length full_sequence $200.;
retain full_sequence;
if first.pat_id then full_sequence='';
if first.sequence then full_sequence=catx("-", full_sequence, sequence);
if last.pat_id then output;
keep pat_id full_sequence;
run;
So do you need the time field at all? If not then you can do this very simple by doing a nodupkey:
proc sort data=yourfile nodupkey out=yourfile2;
by pat_id sequence;
run;
If you need to keep that time then you could do something like this:
proc sort data=yourfile;
by sequence pat_id time;
data yourfile2;
set yourfile;
by sequence pat_id time;
if first.pat_id;
run;
thanks, sashelpwanted!
I don't really care about the exact time (ie. what exact time value), but I'm trying to keep the order. So if you look at patient 1, when I try to nodupkey option, I lose the piece when the patient starts taking 'M' again, meaning I lose what happens during time 3-10. When I try the other option you recommend, it only gives me the first observation per patient. Ideally, I would like the first observation whenever there's a change in the sequence, if that makes any sense. Thank you.
Use the notsorted option on your by statement and append the first time you see a new sequence. Output on the last patient ID.
data have;
input pat_id time sequence $;
cards;
1 1 M
1 2 MS
1 3 M
1 4 M
1 5 M
1 6 M
1 7 M
1 8 M
1 9 M
1 10 M
2 1 M
2 2 MST
2 3 ST
2 4 MT
2 5 M
2 6 MS
2 7 S
2 8 S
2 9 S
2 10 S
;
run;
data want;
set have;
by pat_id sequence notsorted;
length full_sequence $200.;
retain full_sequence;
if first.pat_id then full_sequence='';
if first.sequence then full_sequence=catx("-", full_sequence, sequence);
if last.pat_id then output;
keep pat_id full_sequence;
run;
thank you, Reeza! This worked perfectly. much appreciated!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.