I currently have a dataset that looks like this:
pat_id | time | sequence |
1 | 1 | M |
1 | 2 | MS |
1 | 3 | M |
1 | 4 | M |
1 | 5 | M |
1 | 6 | M |
1 | 7 | M |
1 | 8 | M |
1 | 9 | M |
1 | 10 | M |
2 | 1 | M |
2 | 2 | MST |
2 | 3 | ST |
2 | 4 | MT |
2 | 5 | M |
2 | 6 | MS |
2 | 7 | S |
2 | 8 | S |
2 | 9 | S |
2 | 10 | S |
I am trying to make it so that whenever there is a repeat observation from line to line it only records the first observation of that value. I'm looking speciifcally at the sequence variable. For example, I would like to keep only the first three rows of data for patient 1, so I would like to basically get rid of all the 'M' observations for time 4-10.
Ultimately I would like to transpose the data and make a string variable, so I'd like to get my final dataset to look like this, but I'm hung up on the above part....
pat_id | sequence |
1 | M-MS-M |
2 | M-MST-ST-MT-M-MS-S |
Any thoughts? Thank you!
Use the notsorted option on your by statement and append the first time you see a new sequence. Output on the last patient ID.
data have;
input pat_id time sequence $;
cards;
1 1 M
1 2 MS
1 3 M
1 4 M
1 5 M
1 6 M
1 7 M
1 8 M
1 9 M
1 10 M
2 1 M
2 2 MST
2 3 ST
2 4 MT
2 5 M
2 6 MS
2 7 S
2 8 S
2 9 S
2 10 S
;
run;
data want;
set have;
by pat_id sequence notsorted;
length full_sequence $200.;
retain full_sequence;
if first.pat_id then full_sequence='';
if first.sequence then full_sequence=catx("-", full_sequence, sequence);
if last.pat_id then output;
keep pat_id full_sequence;
run;
So do you need the time field at all? If not then you can do this very simple by doing a nodupkey:
proc sort data=yourfile nodupkey out=yourfile2;
by pat_id sequence;
run;
If you need to keep that time then you could do something like this:
proc sort data=yourfile;
by sequence pat_id time;
data yourfile2;
set yourfile;
by sequence pat_id time;
if first.pat_id;
run;
thanks, sashelpwanted!
I don't really care about the exact time (ie. what exact time value), but I'm trying to keep the order. So if you look at patient 1, when I try to nodupkey option, I lose the piece when the patient starts taking 'M' again, meaning I lose what happens during time 3-10. When I try the other option you recommend, it only gives me the first observation per patient. Ideally, I would like the first observation whenever there's a change in the sequence, if that makes any sense. Thank you.
Use the notsorted option on your by statement and append the first time you see a new sequence. Output on the last patient ID.
data have;
input pat_id time sequence $;
cards;
1 1 M
1 2 MS
1 3 M
1 4 M
1 5 M
1 6 M
1 7 M
1 8 M
1 9 M
1 10 M
2 1 M
2 2 MST
2 3 ST
2 4 MT
2 5 M
2 6 MS
2 7 S
2 8 S
2 9 S
2 10 S
;
run;
data want;
set have;
by pat_id sequence notsorted;
length full_sequence $200.;
retain full_sequence;
if first.pat_id then full_sequence='';
if first.sequence then full_sequence=catx("-", full_sequence, sequence);
if last.pat_id then output;
keep pat_id full_sequence;
run;
thank you, Reeza! This worked perfectly. much appreciated!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.