Hi,
I have a dataset with patients who were measured creatinine levels during a period of time. Somehow there were some patients were mistakenly measured more than once per day (e.g. see bolded ones below). I just would like to to keep the last measurement and remove the first measurement (duplicates).
Below is my code, but it did not remove the duplicates. Could anyone please help me?
proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;
00004820 | 2014-05-25 |
00004820 | 2014-05-26 |
00004820 | 2014-05-27 |
00004820 | 2014-05-27 |
00004820 | 2014-05-28 |
00004820 | 2014-06-05 |
00004820 | 2014-06-06 |
00004820 | 2014-06-07 |
00004820 | 2014-06-08 |
00004820 | 2014-06-08 |
00004820 | 2014-06-09 |
00004887 | 2002-02-22 |
00004887 | 2002-02-23 |
00004887 | 2002-02-23 |
00004887 | 2002-02-24 |
00004887 | 2002-03-06 |
00004887 | 2002-03-14 |
00004887 | 2002-03-15 |
00004887 | 2002-03-15 |
00004887 | 2002-03-16 |
00004887 | 2002-03-16 |
00004887 | 2002-03-17 |
00004887 | 2002-03-25 |
00004887 | 2002-04-06 |
00004887 | 2002-04-06 |
00004887 | 2002-04-07 |
00004887 | 2002-04-08 |
00004887 | 2002-04-09 |
00004887 | 2002-04-10 |
00004887 | 2002-04-11 |
00004887 | 2002-04-12 |
00004887 | 2002-04-13 |
00004887 | 2002-04-14 |
00004887 | 2002-04-15 |
00004887 | 2002-04-16 |
00004887 | 2002-04-16 |
00004887 | 2002-04-17 |
00004887 | 2002-04-18 |
00004887 | 2002-04-19 |
00004887 | 2002-04-20 |
00004887 | 2002-04-21 |
00004887 | 2002-04-22 |
00004887 | 2002-04-23 |
00004887 | 2002-04-24 |
00004887 | 2002-04-25 |
00004887 | 2002-04-26 |
00004887 | 2002-04-27 |
00004887 | 2002-04-28 |
00004887 | 2002-04-29 |
00004887 | 2002-04-30 |
00004887 | 2002-05-01 |
00004887 | 2002-05-02 |
00004887 | 2002-05-03 |
00004887 | 2002-05-03 |
00004887 | 2002-05-04 |
00004887 | 2002-05-04 |
00004887 | 2002-05-05 |
00004887 | 2002-05-05 |
00004887 | 2002-05-05 |
00004887 | 2002-05-06 |
00004887 | 2002-05-06 |
00004887 | 2002-05-06 |
00004887 | 2002-05-07 |
00004887 | 2002-05-08 |
00004887 | 2002-05-08 |
00004887 | 2002-05-09 |
00004887 | 2002-05-10 |
00004887 | 2002-05-10 |
00004887 | 2002-05-11 |
00004887 | 2002-05-12 |
00004887 | 2002-05-13 |
00004887 | 2002-05-14 |
00004887 | 2002-05-15 |
00004887 | 2002-05-16 |
00004887 | 2002-05-17 |
00004887 | 2002-05-18 |
00004887 | 2002-05-19 |
00004887 | 2002-05-20 |
00004887 | 2002-05-20 |
00004887 | 2002-05-21 |
00004887 | 2002-05-22 |
00004887 | 2002-05-23 |
00004887 | 2002-05-24 |
00004887 | 2002-05-25 |
00004887 | 2002-05-26 |
00004887 | 2002-05-27 |
00004887 | 2002-05-28 |
00004887 | 2002-05-29 |
00004887 | 2002-05-30 |
00004887 | 2002-05-31 |
00004887 | 2002-06-01 |
00004887 | 2002-06-02 |
00004887 | 2002-06-03 |
00004887 | 2002-06-04 |
00004887 | 2002-06-05 |
00004887 | 2002-06-06 |
00004887 | 2002-06-07 |
00004887 | 2002-06-08 |
00004887 | 2002-06-09 |
00004887 | 2002-06-10 |
00004887 | 2002-06-11 |
00004887 | 2002-06-25 |
00004887 | 2002-06-26 |
00004887 | 2002-06-26 |
00004887 | 2002-06-26 |
00004887 | 2002-06-27 |
00004887 | 2002-06-28 |
00004887 | 2002-06-28 |
00004887 | 2002-06-29 |
00004887 | 2002-06-29 |
00004887 | 2002-06-30 |
00004887 | 2002-06-30 |
00004887 | 2002-07-01 |
00004887 | 2002-07-02 |
00004887 | 2002-07-02 |
00004887 | 2002-07-02 |
00004887 | 2002-07-02 |
00004887 | 2002-07-03 |
00004887 | 2002-07-03 |
00004887 | 2002-07-03 |
00004887 | 2002-07-04 |
00004887 | 2002-07-05 |
00004887 | 2002-07-06 |
00004887 | 2002-07-06 |
00004887 | 2002-07-07 |
00004887 | 2002-07-07 |
00004887 | 2002-07-08 |
00004887 | 2002-07-08 |
00004887 | 2002-07-08 |
00004887 | 2002-07-09 |
00004887 | 2002-07-10 |
00004887 | 2002-07-10 |
00004887 | 2002-07-10 |
00004887 | 2002-07-11 |
00004887 | 2002-07-12 |
00004887 | 2002-07-12 |
00004887 | 2002-07-12 |
00004887 | 2002-07-13 |
00004887 | 2002-07-13 |
00004887 | 2002-07-14 |
00004887 | 2002-07-14 |
00004887 | 2002-07-15 |
00004887 | 2002-07-16 |
00004887 | 2002-07-16 |
00004887 | 2002-07-17 |
00004887 | 2002-07-17 |
00004887 | 2002-07-17 |
00004887 | 2002-07-18 |
00004887 | 2002-07-18 |
00004887 | 2002-07-19 |
00004887 | 2002-07-20 |
00004887 | 2002-07-21 |
00004887 | 2002-07-21 |
00004887 | 2002-07-21 |
00004887 | 2002-07-22 |
00004887 | 2002-07-22 |
00004887 | 2002-07-23 |
00004887 | 2002-07-23 |
00004887 | 2002-07-23 |
00004887 | 2002-07-24 |
00006949 | 2004-06-23 |
00006949 | 2004-06-24 |
00006949 | 2004-06-25 |
00006949 | 2004-06-29 |
00006949 | 2004-07-01 |
00006949 | 2004-07-02 |
00006949 | 2004-07-12 |
00006949 | 2004-07-13 |
00006949 | 2004-07-14 |
00006949 | 2004-07-15 |
00006949 | 2004-07-26 |
00006949 | 2004-07-28 |
00006949 | 2004-08-19 |
00006949 | 2004-09-13 |
00006949 | 2004-10-04 |
00006949 | 2004-10-25 |
00006949 | 2004-11-15 |
00006949 | 2004-12-16 |
HI @Denali By any chance you are after this?
proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;
data want;
set test1;
by id;
if first.id;
run;
Not sure what you are after
I'm afraid that proc sort with nodupkey and descending by variable won't produce what the OP requested, namely the LAST record with a tied key value. In data HAVE have below, presumably the user would want what shows in the log for dataset WANT (with seqnum=key). But the proc sort nodupkey … by descending key produces constant seqnum=1:
616 data have;
617 do key=1 to 10;
618 do seqnum=1 to key;
619 output;
620 end;
621 end;
622 run;623
624 data want;
625 set have;
626 by key;
627 if last.key;
628 put key=z2. seqnum=;
629 run;key=01 seqnum=1
key=02 seqnum=2
key=03 seqnum=3
key=04 seqnum=4
key=05 seqnum=5
key=06 seqnum=6
key=07 seqnum=7
key=08 seqnum=8
key=09 seqnum=9
key=10 seqnum=10
NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
630
631 proc sort data=have out=want2 nodupkey; by descending key; run;NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: 45 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT2 has 10 observations and 2 variables.
632 data _null_;
633 set want2;
634 put key=z2. seqnum=;
635 run;key=10 seqnum=1
key=09 seqnum=1
key=08 seqnum=1
key=07 seqnum=1
key=06 seqnum=1
key=05 seqnum=1
key=04 seqnum=1
key=03 seqnum=1
key=02 seqnum=1
key=01 seqnum=1
Edited addition: My take on this is that NODUPKEY takes the first encountered record with a tied key, not the first output sorted record that is tied. The de-duplication takes place at input time, not output time.
Hmmm if PROC SORT NODUPKEY doesn't delete them, then my guess is they're not really duplicates. I would double-check the values, to make sure they are exact duplicates (no difference in leading blanks, or no decimal values on the date variable, or ... )
Check the format on the column CREATININE_DATE. Perhaps the underlying data has some timestamps associated with it.
Proc contents data=test ;
run;
ANOTHER WAY OF REMOVING DUPLICATES IS BY USING THE NODUP
code;
proc sort data=test out=test1 nodup;
by _all_;
run;
Note that this program will removes the exact duplicates by using all the variables that in the by statement in this case its all the variables
Hi,
The clinicians extracted the data from medical records and they wanted us to keep the last creatinine value if there are multiple measurements on the same date. Maybe I should create a "count" variable per patient per date?
ID | CREATININE DATE | CREATININE VALUE |
00011695 | 2002-01-28 | 0.8 |
00011695 | 2002-01-29 | 1.1 |
00011695 | 2002-01-29 | 1.5 |
00011695 | 2002-01-29 | 1.7 |
00011695 | 2002-01-30 | 2.3 |
00011695 | 2002-01-30 | 2.5 |
00011695 | 2002-01-31 | 2.5 |
00011695 | 2002-02-01 | 3.5 |
00011695 | 2002-02-01 | 2.9 |
00011695 | 2002-02-02 | 2.1 |
00011695 | 2002-02-03 | 1.4 |
00011695 | 2002-02-04 | 1.3 |
00011695 | 2002-02-05 | 1.4 |
00011695 | 2002-02-06 | 1.2 |
00011695 | 2002-02-08 | 1.2 |
00011695 | 2002-02-09 | 1.2 |
00011695 | 2002-02-09 | 1.2 |
00011695 | 2002-02-10 | 1 |
00011695 | 2002-02-11 | 1.2 |
00011695 | 2002-02-12 | 1.2 |
00011695 | 2002-02-13 | 1.1 |
00011695 | 2002-02-14 | 1.2 |
00011695 | 2002-02-15 | 1.2 |
00011695 | 2002-02-18 | 1.2 |
Thank you!
So, in your sample table, how does one know which is the last creatinine value for a given date? Are you depending upon the physical order of data to represent chronological order? Very dangerous. The data you report certainly doesn't provide any extra content to specify which of two records on a given date is the latter.
I just realized that the Creatinine date variable in Excel file carries an underlying time format:
CREATININE DATE |
5/25/14 8:54 |
5/26/14 8:05 |
5/27/14 8:08 |
5/27/14 11:18 |
5/28/14 8:50 |
6/5/14 15:23 |
6/6/14 15:07 |
6/7/14 8:49 |
6/8/14 8:39 |
6/8/14 15:23 |
6/9/14 8:33 |
2/22/02 9:35 |
2/23/02 0:01 |
2/23/02 7:46 |
2/24/02 7:30 |
3/6/02 12:00 |
3/14/02 9:40 |
3/15/02 8:15 |
3/15/02 19:00 |
3/16/02 8:35 |
3/16/02 22:30 |
Thank you.
Can you please post data in usable form? "Usable" means a data step with datalines, so the we can easily copy and execute the step to have what you have.See How to convert datasets to data steps if you don't know how to create the data step.
In the meantime you can try something like:
proc summary data=have nway;
class <insert id variable>;
var <insert datetime variable>;
id <insert other variables to keep>;
format <insert datetime variable> dtdate9.;
output out=work.want(drop=_freq_ _type_) min=;
run;
Unfortunately the format-trick does not work with proc sort and nodup*.
In that case you CAN find the last one. So don't throw away the timepart, use it.
So if your existing "date" variable is named DATETIME you can create two new variables, DATE and TIME.
data step1;
set have;
date=datepart(datetime);
time=timepart(datetime);
format date yymmdd10. time tod5.;
run;
Now sort by subject date and time and then take the last one for that date.
proc sort data=step1 out=want;
by subject date time;
run;
data want;
set step1;
by subject date time;
if last.date;
run;
You guys are genius! Thank you so much!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.