Solved: Re: Remove deuplicates and keep last observation

Denali · Posted 03-30-2020 04:25 PM

Hi,

I have a dataset with patients who were measured creatinine levels during a period of time. Somehow there were some patients were mistakenly measured more than once per day (e.g. see bolded ones below). I just would like to to keep the last measurement and remove the first measurement (duplicates).

Below is my code, but it did not remove the duplicates. Could anyone please help me?

proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;

00004820	2014-05-25
00004820	2014-05-26
00004820	2014-05-27
00004820	2014-05-27
00004820	2014-05-28
00004820	2014-06-05
00004820	2014-06-06
00004820	2014-06-07
00004820	2014-06-08
00004820	2014-06-08
00004820	2014-06-09
00004887	2002-02-22
00004887	2002-02-23
00004887	2002-02-23
00004887	2002-02-24
00004887	2002-03-06
00004887	2002-03-14
00004887	2002-03-15
00004887	2002-03-15
00004887	2002-03-16
00004887	2002-03-16
00004887	2002-03-17
00004887	2002-03-25
00004887	2002-04-06
00004887	2002-04-06
00004887	2002-04-07
00004887	2002-04-08
00004887	2002-04-09
00004887	2002-04-10
00004887	2002-04-11
00004887	2002-04-12
00004887	2002-04-13
00004887	2002-04-14
00004887	2002-04-15
00004887	2002-04-16
00004887	2002-04-16
00004887	2002-04-17
00004887	2002-04-18
00004887	2002-04-19
00004887	2002-04-20
00004887	2002-04-21
00004887	2002-04-22
00004887	2002-04-23
00004887	2002-04-24
00004887	2002-04-25
00004887	2002-04-26
00004887	2002-04-27
00004887	2002-04-28
00004887	2002-04-29
00004887	2002-04-30
00004887	2002-05-01
00004887	2002-05-02
00004887	2002-05-03
00004887	2002-05-03
00004887	2002-05-04
00004887	2002-05-04
00004887	2002-05-05
00004887	2002-05-05
00004887	2002-05-05
00004887	2002-05-06
00004887	2002-05-06
00004887	2002-05-06
00004887	2002-05-07
00004887	2002-05-08
00004887	2002-05-08
00004887	2002-05-09
00004887	2002-05-10
00004887	2002-05-10
00004887	2002-05-11
00004887	2002-05-12
00004887	2002-05-13
00004887	2002-05-14
00004887	2002-05-15
00004887	2002-05-16
00004887	2002-05-17
00004887	2002-05-18
00004887	2002-05-19
00004887	2002-05-20
00004887	2002-05-20
00004887	2002-05-21
00004887	2002-05-22
00004887	2002-05-23
00004887	2002-05-24
00004887	2002-05-25
00004887	2002-05-26
00004887	2002-05-27
00004887	2002-05-28
00004887	2002-05-29
00004887	2002-05-30
00004887	2002-05-31
00004887	2002-06-01
00004887	2002-06-02
00004887	2002-06-03
00004887	2002-06-04
00004887	2002-06-05
00004887	2002-06-06
00004887	2002-06-07
00004887	2002-06-08
00004887	2002-06-09
00004887	2002-06-10
00004887	2002-06-11
00004887	2002-06-25
00004887	2002-06-26
00004887	2002-06-26
00004887	2002-06-26
00004887	2002-06-27
00004887	2002-06-28
00004887	2002-06-28
00004887	2002-06-29
00004887	2002-06-29
00004887	2002-06-30
00004887	2002-06-30
00004887	2002-07-01
00004887	2002-07-02
00004887	2002-07-02
00004887	2002-07-02
00004887	2002-07-02
00004887	2002-07-03
00004887	2002-07-03
00004887	2002-07-03
00004887	2002-07-04
00004887	2002-07-05
00004887	2002-07-06
00004887	2002-07-06
00004887	2002-07-07
00004887	2002-07-07
00004887	2002-07-08
00004887	2002-07-08
00004887	2002-07-08
00004887	2002-07-09
00004887	2002-07-10
00004887	2002-07-10
00004887	2002-07-10
00004887	2002-07-11
00004887	2002-07-12
00004887	2002-07-12
00004887	2002-07-12
00004887	2002-07-13
00004887	2002-07-13
00004887	2002-07-14
00004887	2002-07-14
00004887	2002-07-15
00004887	2002-07-16
00004887	2002-07-16
00004887	2002-07-17
00004887	2002-07-17
00004887	2002-07-17
00004887	2002-07-18
00004887	2002-07-18
00004887	2002-07-19
00004887	2002-07-20
00004887	2002-07-21
00004887	2002-07-21
00004887	2002-07-21
00004887	2002-07-22
00004887	2002-07-22
00004887	2002-07-23
00004887	2002-07-23
00004887	2002-07-23
00004887	2002-07-24
00006949	2004-06-23
00006949	2004-06-24
00006949	2004-06-25
00006949	2004-06-29
00006949	2004-07-01
00006949	2004-07-02
00006949	2004-07-12
00006949	2004-07-13
00006949	2004-07-14
00006949	2004-07-15
00006949	2004-07-26
00006949	2004-07-28
00006949	2004-08-19
00006949	2004-09-13
00006949	2004-10-04
00006949	2004-10-25
00006949	2004-11-15
00006949	2004-12-16

Ksharp · Posted 03-31-2020 06:55 AM

Maybe your CREATININE_DATE is not exact integer, they have decimal.
try : CREATININE_DATE=int( CREATININE_DATE ) ;
and run PROC SORT again.

View solution in original post

novinosrin · Posted 03-30-2020 04:33 PM

HI @Denali By any chance you are after this?

proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;

data want;
 set test1;
 by id;
 if first.id;
run;

Not sure what you are after

mkeintz · Posted 03-30-2020 05:13 PM

I'm afraid that proc sort with nodupkey and descending by variable won't produce what the OP requested, namely the LAST record with a tied key value. In data HAVE have below, presumably the user would want what shows in the log for dataset WANT (with seqnum=key). But the proc sort nodupkey … by descending key produces constant seqnum=1:

616 data have;
617 do key=1 to 10;
618 do seqnum=1 to key;
619 output;
620 end;
621 end;
622 run;

623
624 data want;
625 set have;
626 by key;
627 if last.key;
628 put key=z2. seqnum=;
629 run;

key=01 seqnum=1
key=02 seqnum=2
key=03 seqnum=3
key=04 seqnum=4
key=05 seqnum=5
key=06 seqnum=6
key=07 seqnum=7
key=08 seqnum=8
key=09 seqnum=9
key=10 seqnum=10
NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
630
631 proc sort data=have out=want2 nodupkey; by descending key; run;

NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: 45 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT2 has 10 observations and 2 variables.
632 data _null_;
633 set want2;
634 put key=z2. seqnum=;
635 run;

key=10 seqnum=1
key=09 seqnum=1
key=08 seqnum=1
key=07 seqnum=1
key=06 seqnum=1
key=05 seqnum=1
key=04 seqnum=1
key=03 seqnum=1
key=02 seqnum=1
key=01 seqnum=1

Edited addition: My take on this is that NODUPKEY takes the first encountered record with a tied key, not the first output sorted record that is tied. The de-duplication takes place at input time, not output time.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Quentin · Posted 03-30-2020 04:36 PM

Hmmm if PROC SORT NODUPKEY doesn't delete them, then my guess is they're not really duplicates. I would double-check the values, to make sure they are exact duplicates (no difference in leading blanks, or no decimal values on the date variable, or ... )

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

r_behata · Posted 03-30-2020 04:38 PM

Check the format on the column CREATININE_DATE. Perhaps the underlying data has some timestamps associated with it.

Proc contents data=test ;
run;

himself · Posted 03-30-2020 05:42 PM

ANOTHER WAY OF REMOVING DUPLICATES IS BY USING THE NODUP

code;

proc sort data=test out=test1 nodup;

by _all_;

run;

Note that this program will removes the exact duplicates by using all the variables that in the by statement in this case its all the variables

@Denali

Tom · Posted 03-30-2020 08:54 PM

If they are duplicates why does it matter if you keep the first/last or one in the middle? And if it does matter what is the third variable that tells you which is the last for the same day? Is there a TIME of day variable also?

Denali · Posted 03-30-2020 09:05 PM

Hi,

The clinicians extracted the data from medical records and they wanted us to keep the last creatinine value if there are multiple measurements on the same date. Maybe I should create a "count" variable per patient per date?

ID	CREATININE DATE	CREATININE VALUE
00011695	2002-01-28	0.8
00011695	2002-01-29	1.1
00011695	2002-01-29	1.5
00011695	2002-01-29	1.7
00011695	2002-01-30	2.3
00011695	2002-01-30	2.5
00011695	2002-01-31	2.5
00011695	2002-02-01	3.5
00011695	2002-02-01	2.9
00011695	2002-02-02	2.1
00011695	2002-02-03	1.4
00011695	2002-02-04	1.3
00011695	2002-02-05	1.4
00011695	2002-02-06	1.2
00011695	2002-02-08	1.2
00011695	2002-02-09	1.2
00011695	2002-02-09	1.2
00011695	2002-02-10	1
00011695	2002-02-11	1.2
00011695	2002-02-12	1.2
00011695	2002-02-13	1.1
00011695	2002-02-14	1.2
00011695	2002-02-15	1.2
00011695	2002-02-18	1.2

Thank you!

mkeintz · Posted 03-30-2020 09:38 PM

So, in your sample table, how does one know which is the last creatinine value for a given date? Are you depending upon the physical order of data to represent chronological order? Very dangerous. The data you report certainly doesn't provide any extra content to specify which of two records on a given date is the latter.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Denali · Posted 03-30-2020 11:30 PM

I just realized that the Creatinine date variable in Excel file carries an underlying time format:

CREATININE DATE

5/25/14 8:54

5/26/14 8:05

5/27/14 8:08

5/27/14 11:18

5/28/14 8:50

6/5/14 15:23

6/6/14 15:07

6/7/14 8:49

6/8/14 8:39

6/8/14 15:23

6/9/14 8:33

2/22/02 9:35

2/23/02 0:01

2/23/02 7:46

2/24/02 7:30

3/6/02 12:00

3/14/02 9:40

3/15/02 8:15

3/15/02 19:00

3/16/02 8:35

3/16/02 22:30

Thank you.

andreas_lds · Posted 03-31-2020 01:47 AM

Can you please post data in usable form? "Usable" means a data step with datalines, so the we can easily copy and execute the step to have what you have.See How to convert datasets to data steps if you don't know how to create the data step.

In the meantime you can try something like:

proc summary data=have nway;
	class <insert id variable>;
	var <insert datetime variable>;
	id <insert other variables to keep>;
	format <insert datetime variable> dtdate9.; 
	output out=work.want(drop=_freq_ _type_) min=;
run;

Unfortunately the format-trick does not work with proc sort and nodup*.

Tom · Posted 03-31-2020 09:13 AM

In that case you CAN find the last one. So don't throw away the timepart, use it.

So if your existing "date" variable is named DATETIME you can create two new variables, DATE and TIME.

data step1;
  set have;
  date=datepart(datetime);
  time=timepart(datetime);
  format date yymmdd10. time tod5.;
run;

Now sort by subject date and time and then take the last one for that date.

proc sort data=step1 out=want;
  by subject date time;
run;
data want;
  set step1;
  by subject date time;
  if last.date;
run;

Ksharp · Posted 03-31-2020 06:55 AM

Maybe your CREATININE_DATE is not exact integer, they have decimal.
try : CREATININE_DATE=int( CREATININE_DATE ) ;
and run PROC SORT again.

Denali · Posted 03-31-2020 09:05 AM

You guys are genius! Thank you so much!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away