- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a dataset with patients who were measured creatinine levels during a period of time. Somehow there were some patients were mistakenly measured more than once per day (e.g. see bolded ones below). I just would like to to keep the last measurement and remove the first measurement (duplicates).
Below is my code, but it did not remove the duplicates. Could anyone please help me?
proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;
00004820 | 2014-05-25 |
00004820 | 2014-05-26 |
00004820 | 2014-05-27 |
00004820 | 2014-05-27 |
00004820 | 2014-05-28 |
00004820 | 2014-06-05 |
00004820 | 2014-06-06 |
00004820 | 2014-06-07 |
00004820 | 2014-06-08 |
00004820 | 2014-06-08 |
00004820 | 2014-06-09 |
00004887 | 2002-02-22 |
00004887 | 2002-02-23 |
00004887 | 2002-02-23 |
00004887 | 2002-02-24 |
00004887 | 2002-03-06 |
00004887 | 2002-03-14 |
00004887 | 2002-03-15 |
00004887 | 2002-03-15 |
00004887 | 2002-03-16 |
00004887 | 2002-03-16 |
00004887 | 2002-03-17 |
00004887 | 2002-03-25 |
00004887 | 2002-04-06 |
00004887 | 2002-04-06 |
00004887 | 2002-04-07 |
00004887 | 2002-04-08 |
00004887 | 2002-04-09 |
00004887 | 2002-04-10 |
00004887 | 2002-04-11 |
00004887 | 2002-04-12 |
00004887 | 2002-04-13 |
00004887 | 2002-04-14 |
00004887 | 2002-04-15 |
00004887 | 2002-04-16 |
00004887 | 2002-04-16 |
00004887 | 2002-04-17 |
00004887 | 2002-04-18 |
00004887 | 2002-04-19 |
00004887 | 2002-04-20 |
00004887 | 2002-04-21 |
00004887 | 2002-04-22 |
00004887 | 2002-04-23 |
00004887 | 2002-04-24 |
00004887 | 2002-04-25 |
00004887 | 2002-04-26 |
00004887 | 2002-04-27 |
00004887 | 2002-04-28 |
00004887 | 2002-04-29 |
00004887 | 2002-04-30 |
00004887 | 2002-05-01 |
00004887 | 2002-05-02 |
00004887 | 2002-05-03 |
00004887 | 2002-05-03 |
00004887 | 2002-05-04 |
00004887 | 2002-05-04 |
00004887 | 2002-05-05 |
00004887 | 2002-05-05 |
00004887 | 2002-05-05 |
00004887 | 2002-05-06 |
00004887 | 2002-05-06 |
00004887 | 2002-05-06 |
00004887 | 2002-05-07 |
00004887 | 2002-05-08 |
00004887 | 2002-05-08 |
00004887 | 2002-05-09 |
00004887 | 2002-05-10 |
00004887 | 2002-05-10 |
00004887 | 2002-05-11 |
00004887 | 2002-05-12 |
00004887 | 2002-05-13 |
00004887 | 2002-05-14 |
00004887 | 2002-05-15 |
00004887 | 2002-05-16 |
00004887 | 2002-05-17 |
00004887 | 2002-05-18 |
00004887 | 2002-05-19 |
00004887 | 2002-05-20 |
00004887 | 2002-05-20 |
00004887 | 2002-05-21 |
00004887 | 2002-05-22 |
00004887 | 2002-05-23 |
00004887 | 2002-05-24 |
00004887 | 2002-05-25 |
00004887 | 2002-05-26 |
00004887 | 2002-05-27 |
00004887 | 2002-05-28 |
00004887 | 2002-05-29 |
00004887 | 2002-05-30 |
00004887 | 2002-05-31 |
00004887 | 2002-06-01 |
00004887 | 2002-06-02 |
00004887 | 2002-06-03 |
00004887 | 2002-06-04 |
00004887 | 2002-06-05 |
00004887 | 2002-06-06 |
00004887 | 2002-06-07 |
00004887 | 2002-06-08 |
00004887 | 2002-06-09 |
00004887 | 2002-06-10 |
00004887 | 2002-06-11 |
00004887 | 2002-06-25 |
00004887 | 2002-06-26 |
00004887 | 2002-06-26 |
00004887 | 2002-06-26 |
00004887 | 2002-06-27 |
00004887 | 2002-06-28 |
00004887 | 2002-06-28 |
00004887 | 2002-06-29 |
00004887 | 2002-06-29 |
00004887 | 2002-06-30 |
00004887 | 2002-06-30 |
00004887 | 2002-07-01 |
00004887 | 2002-07-02 |
00004887 | 2002-07-02 |
00004887 | 2002-07-02 |
00004887 | 2002-07-02 |
00004887 | 2002-07-03 |
00004887 | 2002-07-03 |
00004887 | 2002-07-03 |
00004887 | 2002-07-04 |
00004887 | 2002-07-05 |
00004887 | 2002-07-06 |
00004887 | 2002-07-06 |
00004887 | 2002-07-07 |
00004887 | 2002-07-07 |
00004887 | 2002-07-08 |
00004887 | 2002-07-08 |
00004887 | 2002-07-08 |
00004887 | 2002-07-09 |
00004887 | 2002-07-10 |
00004887 | 2002-07-10 |
00004887 | 2002-07-10 |
00004887 | 2002-07-11 |
00004887 | 2002-07-12 |
00004887 | 2002-07-12 |
00004887 | 2002-07-12 |
00004887 | 2002-07-13 |
00004887 | 2002-07-13 |
00004887 | 2002-07-14 |
00004887 | 2002-07-14 |
00004887 | 2002-07-15 |
00004887 | 2002-07-16 |
00004887 | 2002-07-16 |
00004887 | 2002-07-17 |
00004887 | 2002-07-17 |
00004887 | 2002-07-17 |
00004887 | 2002-07-18 |
00004887 | 2002-07-18 |
00004887 | 2002-07-19 |
00004887 | 2002-07-20 |
00004887 | 2002-07-21 |
00004887 | 2002-07-21 |
00004887 | 2002-07-21 |
00004887 | 2002-07-22 |
00004887 | 2002-07-22 |
00004887 | 2002-07-23 |
00004887 | 2002-07-23 |
00004887 | 2002-07-23 |
00004887 | 2002-07-24 |
00006949 | 2004-06-23 |
00006949 | 2004-06-24 |
00006949 | 2004-06-25 |
00006949 | 2004-06-29 |
00006949 | 2004-07-01 |
00006949 | 2004-07-02 |
00006949 | 2004-07-12 |
00006949 | 2004-07-13 |
00006949 | 2004-07-14 |
00006949 | 2004-07-15 |
00006949 | 2004-07-26 |
00006949 | 2004-07-28 |
00006949 | 2004-08-19 |
00006949 | 2004-09-13 |
00006949 | 2004-10-04 |
00006949 | 2004-10-25 |
00006949 | 2004-11-15 |
00006949 | 2004-12-16 |
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
try : CREATININE_DATE=int( CREATININE_DATE ) ;
and run PROC SORT again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
HI @Denali By any chance you are after this?
proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;
data want;
set test1;
by id;
if first.id;
run;
Not sure what you are after
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm afraid that proc sort with nodupkey and descending by variable won't produce what the OP requested, namely the LAST record with a tied key value. In data HAVE have below, presumably the user would want what shows in the log for dataset WANT (with seqnum=key). But the proc sort nodupkey … by descending key produces constant seqnum=1:
616 data have;
617 do key=1 to 10;
618 do seqnum=1 to key;
619 output;
620 end;
621 end;
622 run;623
624 data want;
625 set have;
626 by key;
627 if last.key;
628 put key=z2. seqnum=;
629 run;key=01 seqnum=1
key=02 seqnum=2
key=03 seqnum=3
key=04 seqnum=4
key=05 seqnum=5
key=06 seqnum=6
key=07 seqnum=7
key=08 seqnum=8
key=09 seqnum=9
key=10 seqnum=10
NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
630
631 proc sort data=have out=want2 nodupkey; by descending key; run;NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: 45 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT2 has 10 observations and 2 variables.
632 data _null_;
633 set want2;
634 put key=z2. seqnum=;
635 run;key=10 seqnum=1
key=09 seqnum=1
key=08 seqnum=1
key=07 seqnum=1
key=06 seqnum=1
key=05 seqnum=1
key=04 seqnum=1
key=03 seqnum=1
key=02 seqnum=1
key=01 seqnum=1
Edited addition: My take on this is that NODUPKEY takes the first encountered record with a tied key, not the first output sorted record that is tied. The de-duplication takes place at input time, not output time.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hmmm if PROC SORT NODUPKEY doesn't delete them, then my guess is they're not really duplicates. I would double-check the values, to make sure they are exact duplicates (no difference in leading blanks, or no decimal values on the date variable, or ... )
Next up: SAS Trivia Quiz hosted by SAS on Wednesday May 21.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Check the format on the column CREATININE_DATE. Perhaps the underlying data has some timestamps associated with it.
Proc contents data=test ;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ANOTHER WAY OF REMOVING DUPLICATES IS BY USING THE NODUP
code;
proc sort data=test out=test1 nodup;
by _all_;
run;
Note that this program will removes the exact duplicates by using all the variables that in the by statement in this case its all the variables
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The clinicians extracted the data from medical records and they wanted us to keep the last creatinine value if there are multiple measurements on the same date. Maybe I should create a "count" variable per patient per date?
ID | CREATININE DATE | CREATININE VALUE |
00011695 | 2002-01-28 | 0.8 |
00011695 | 2002-01-29 | 1.1 |
00011695 | 2002-01-29 | 1.5 |
00011695 | 2002-01-29 | 1.7 |
00011695 | 2002-01-30 | 2.3 |
00011695 | 2002-01-30 | 2.5 |
00011695 | 2002-01-31 | 2.5 |
00011695 | 2002-02-01 | 3.5 |
00011695 | 2002-02-01 | 2.9 |
00011695 | 2002-02-02 | 2.1 |
00011695 | 2002-02-03 | 1.4 |
00011695 | 2002-02-04 | 1.3 |
00011695 | 2002-02-05 | 1.4 |
00011695 | 2002-02-06 | 1.2 |
00011695 | 2002-02-08 | 1.2 |
00011695 | 2002-02-09 | 1.2 |
00011695 | 2002-02-09 | 1.2 |
00011695 | 2002-02-10 | 1 |
00011695 | 2002-02-11 | 1.2 |
00011695 | 2002-02-12 | 1.2 |
00011695 | 2002-02-13 | 1.1 |
00011695 | 2002-02-14 | 1.2 |
00011695 | 2002-02-15 | 1.2 |
00011695 | 2002-02-18 | 1.2 |
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So, in your sample table, how does one know which is the last creatinine value for a given date? Are you depending upon the physical order of data to represent chronological order? Very dangerous. The data you report certainly doesn't provide any extra content to specify which of two records on a given date is the latter.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I just realized that the Creatinine date variable in Excel file carries an underlying time format:
CREATININE DATE |
5/25/14 8:54 |
5/26/14 8:05 |
5/27/14 8:08 |
5/27/14 11:18 |
5/28/14 8:50 |
6/5/14 15:23 |
6/6/14 15:07 |
6/7/14 8:49 |
6/8/14 8:39 |
6/8/14 15:23 |
6/9/14 8:33 |
2/22/02 9:35 |
2/23/02 0:01 |
2/23/02 7:46 |
2/24/02 7:30 |
3/6/02 12:00 |
3/14/02 9:40 |
3/15/02 8:15 |
3/15/02 19:00 |
3/16/02 8:35 |
3/16/02 22:30 |
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you please post data in usable form? "Usable" means a data step with datalines, so the we can easily copy and execute the step to have what you have.See How to convert datasets to data steps if you don't know how to create the data step.
In the meantime you can try something like:
proc summary data=have nway;
class <insert id variable>;
var <insert datetime variable>;
id <insert other variables to keep>;
format <insert datetime variable> dtdate9.;
output out=work.want(drop=_freq_ _type_) min=;
run;
Unfortunately the format-trick does not work with proc sort and nodup*.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In that case you CAN find the last one. So don't throw away the timepart, use it.
So if your existing "date" variable is named DATETIME you can create two new variables, DATE and TIME.
data step1;
set have;
date=datepart(datetime);
time=timepart(datetime);
format date yymmdd10. time tod5.;
run;
Now sort by subject date and time and then take the last one for that date.
proc sort data=step1 out=want;
by subject date time;
run;
data want;
set step1;
by subject date time;
if last.date;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
try : CREATININE_DATE=int( CREATININE_DATE ) ;
and run PROC SORT again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You guys are genius! Thank you so much!