BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Denali
Quartz | Level 8

Hi,

 

I have a dataset with patients who were measured creatinine levels during a period of time. Somehow there were some patients were mistakenly measured more than once per day (e.g. see bolded ones below). I just would like to to keep the last measurement and remove the first measurement (duplicates).

 

Below is my code, but it did not remove the duplicates. Could anyone please help me?


proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;


000048202014-05-25
000048202014-05-26
000048202014-05-27
000048202014-05-27
000048202014-05-28
000048202014-06-05
000048202014-06-06
000048202014-06-07
000048202014-06-08
000048202014-06-08
000048202014-06-09
000048872002-02-22
000048872002-02-23
000048872002-02-23
000048872002-02-24
000048872002-03-06
000048872002-03-14
000048872002-03-15
000048872002-03-15
000048872002-03-16
000048872002-03-16
000048872002-03-17
000048872002-03-25
000048872002-04-06
000048872002-04-06
000048872002-04-07
000048872002-04-08
000048872002-04-09
000048872002-04-10
000048872002-04-11
000048872002-04-12
000048872002-04-13
000048872002-04-14
000048872002-04-15
000048872002-04-16
000048872002-04-16
000048872002-04-17
000048872002-04-18
000048872002-04-19
000048872002-04-20
000048872002-04-21
000048872002-04-22
000048872002-04-23
000048872002-04-24
000048872002-04-25
000048872002-04-26
000048872002-04-27
000048872002-04-28
000048872002-04-29
000048872002-04-30
000048872002-05-01
000048872002-05-02
000048872002-05-03
000048872002-05-03
000048872002-05-04
000048872002-05-04
000048872002-05-05
000048872002-05-05
000048872002-05-05
000048872002-05-06
000048872002-05-06
000048872002-05-06
000048872002-05-07
000048872002-05-08
000048872002-05-08
000048872002-05-09
000048872002-05-10
000048872002-05-10
000048872002-05-11
000048872002-05-12
000048872002-05-13
000048872002-05-14
000048872002-05-15
000048872002-05-16
000048872002-05-17
000048872002-05-18
000048872002-05-19
000048872002-05-20
000048872002-05-20
000048872002-05-21
000048872002-05-22
000048872002-05-23
000048872002-05-24
000048872002-05-25
000048872002-05-26
000048872002-05-27
000048872002-05-28
000048872002-05-29
000048872002-05-30
000048872002-05-31
000048872002-06-01
000048872002-06-02
000048872002-06-03
000048872002-06-04
000048872002-06-05
000048872002-06-06
000048872002-06-07
000048872002-06-08
000048872002-06-09
000048872002-06-10
000048872002-06-11
000048872002-06-25
000048872002-06-26
000048872002-06-26
000048872002-06-26
000048872002-06-27
000048872002-06-28
000048872002-06-28
000048872002-06-29
000048872002-06-29
000048872002-06-30
000048872002-06-30
000048872002-07-01
000048872002-07-02
000048872002-07-02
000048872002-07-02
000048872002-07-02
000048872002-07-03
000048872002-07-03
000048872002-07-03
000048872002-07-04
000048872002-07-05
000048872002-07-06
000048872002-07-06
000048872002-07-07
000048872002-07-07
000048872002-07-08
000048872002-07-08
000048872002-07-08
000048872002-07-09
000048872002-07-10
000048872002-07-10
000048872002-07-10
000048872002-07-11
000048872002-07-12
000048872002-07-12
000048872002-07-12
000048872002-07-13
000048872002-07-13
000048872002-07-14
000048872002-07-14
000048872002-07-15
000048872002-07-16
000048872002-07-16
000048872002-07-17
000048872002-07-17
000048872002-07-17
000048872002-07-18
000048872002-07-18
000048872002-07-19
000048872002-07-20
000048872002-07-21
000048872002-07-21
000048872002-07-21
000048872002-07-22
000048872002-07-22
000048872002-07-23
000048872002-07-23
000048872002-07-23
000048872002-07-24
000069492004-06-23
000069492004-06-24
000069492004-06-25
000069492004-06-29
000069492004-07-01
000069492004-07-02
000069492004-07-12
000069492004-07-13
000069492004-07-14
000069492004-07-15
000069492004-07-26
000069492004-07-28
000069492004-08-19
000069492004-09-13
000069492004-10-04
000069492004-10-25
000069492004-11-15
000069492004-12-16
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
Maybe your CREATININE_DATE is not exact integer, they have decimal.
try : CREATININE_DATE=int( CREATININE_DATE ) ;
and run PROC SORT again.

View solution in original post

13 REPLIES 13
novinosrin
Tourmaline | Level 20

HI @Denali  By any chance you are after this?

 

proc sort data= test out=test1 nodupkey ;
by ID descending CREATININE_DATE;
run;

data want;
 set test1;
 by id;
 if first.id;
run;

Not sure what you are after

mkeintz
PROC Star

I'm afraid that proc sort with nodupkey  and descending by variable won't produce what the OP requested, namely the LAST record with a tied key value.  In data HAVE have below, presumably the user would want what shows in the log for dataset WANT (with seqnum=key).  But the proc sort nodupkey … by descending key produces constant seqnum=1:

 

616 data have;
617 do key=1 to 10;
618 do seqnum=1 to key;
619 output;
620 end;
621 end;
622 run;

623
624 data want;
625 set have;
626 by key;
627 if last.key;
628 put key=z2. seqnum=;
629 run;

key=01 seqnum=1
key=02 seqnum=2
key=03 seqnum=3
key=04 seqnum=4
key=05 seqnum=5
key=06 seqnum=6
key=07 seqnum=7
key=08 seqnum=8
key=09 seqnum=9
key=10 seqnum=10
NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 2 variables.
630
631 proc sort data=have out=want2 nodupkey; by descending key; run;

NOTE: There were 55 observations read from the data set WORK.HAVE.
NOTE: 45 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT2 has 10 observations and 2 variables.
632 data _null_;
633 set want2;
634 put key=z2. seqnum=;
635 run;

key=10 seqnum=1
key=09 seqnum=1
key=08 seqnum=1
key=07 seqnum=1
key=06 seqnum=1
key=05 seqnum=1
key=04 seqnum=1
key=03 seqnum=1
key=02 seqnum=1
key=01 seqnum=1

 Edited addition:  My take on this is that NODUPKEY takes the first encountered record with a tied key, not the first output sorted record that is tied.  The de-duplication takes place at input time, not output time.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Quentin
Super User

Hmmm if PROC SORT NODUPKEY doesn't delete them, then my guess is they're not really duplicates.  I would double-check the values, to make sure they are exact duplicates (no difference in leading blanks, or no decimal values on the date variable, or ... )

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
r_behata
Barite | Level 11

Check the format on the column CREATININE_DATE. Perhaps the underlying data has some timestamps associated with it.

 

Proc contents data=test ;
run;
himself
Pyrite | Level 9

ANOTHER WAY OF REMOVING DUPLICATES IS BY USING THE NODUP

   code;

proc sort data=test out=test1 nodup;

   by _all_;

run;

 

Note that this program will removes the exact duplicates by using all the variables that in the by statement  in this case its all  the variables

 

 

  

 

 

Tom
Super User Tom
Super User
If they are duplicates why does it matter if you keep the first/last or one in the middle? And if it does matter what is the third variable that tells you which is the last for the same day? Is there a TIME of day variable also?
Denali
Quartz | Level 8

Hi,

 

The clinicians extracted the data from medical records and they wanted us to keep the last creatinine value if there are multiple measurements on the same date. Maybe I should create a "count" variable per patient per date? 

 

IDCREATININE DATECREATININE VALUE
000116952002-01-280.8
000116952002-01-291.1
000116952002-01-291.5
000116952002-01-291.7
000116952002-01-302.3
000116952002-01-302.5
000116952002-01-312.5
000116952002-02-013.5
000116952002-02-012.9
000116952002-02-022.1
000116952002-02-031.4
000116952002-02-041.3
000116952002-02-051.4
000116952002-02-061.2
000116952002-02-081.2
000116952002-02-091.2
000116952002-02-091.2
000116952002-02-101
000116952002-02-111.2
000116952002-02-121.2
000116952002-02-131.1
000116952002-02-141.2
000116952002-02-151.2
000116952002-02-181.2

 

Thank you!

mkeintz
PROC Star

So, in your sample table, how does one know which is the last creatinine value for a given date?  Are you depending upon the physical order of data to represent chronological order?   Very dangerous.  The data you report certainly doesn't provide any extra content to specify which of two records on a given date is the latter.  

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Denali
Quartz | Level 8

I just realized that the Creatinine date variable in Excel file carries an underlying time format:

 

CREATININE DATE
5/25/14 8:54
5/26/14 8:05
5/27/14 8:08
5/27/14 11:18
5/28/14 8:50
6/5/14 15:23
6/6/14 15:07
6/7/14 8:49
6/8/14 8:39
6/8/14 15:23
6/9/14 8:33
2/22/02 9:35
2/23/02 0:01
2/23/02 7:46
2/24/02 7:30
3/6/02 12:00
3/14/02 9:40
3/15/02 8:15
3/15/02 19:00
3/16/02 8:35
3/16/02 22:30

 

Thank you.

andreas_lds
Jade | Level 19

Can you please post data in usable form? "Usable" means a data step with datalines, so the we can easily copy and execute the step to have what you have.See How to convert datasets to data steps if you don't know how to create the data step.

 

In the meantime you can try something like:

proc summary data=have nway;
	class <insert id variable>;
	var <insert datetime variable>;
	id <insert other variables to keep>;
	format <insert datetime variable> dtdate9.; 
	output out=work.want(drop=_freq_ _type_) min=;
run;

Unfortunately the format-trick does not work with proc sort and nodup*.

Tom
Super User Tom
Super User

In that case you CAN find the last one.  So don't throw away the timepart, use it.

So if your existing "date" variable is named DATETIME you can create two new variables, DATE and TIME.

 

data step1;
  set have;
  date=datepart(datetime);
  time=timepart(datetime);
  format date yymmdd10. time tod5.;
run;

Now sort by subject date and time and then take the last one for that date.

 

proc sort data=step1 out=want;
  by subject date time;
run;
data want;
  set step1;
  by subject date time;
  if last.date;
run;

 

Ksharp
Super User
Maybe your CREATININE_DATE is not exact integer, they have decimal.
try : CREATININE_DATE=int( CREATININE_DATE ) ;
and run PROC SORT again.
Denali
Quartz | Level 8

You guys are genius! Thank you so much!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 5996 views
  • 3 likes
  • 9 in conversation