DATA Step, Macro, Functions and more

How to eliminate "duplicate" records based on a date criteria (not created)

Reply
New Contributor
Posts: 4

How to eliminate "duplicate" records based on a date criteria (not created)

[ Edited ]

Hello everyone,

 

I am trying to get rid of some duplicate records based on:

1. If A, C, D, E, F, G is same AND

2. The difference between B is lower than 42 days 

 

 

 

ABCDEFG
SL001067142013-08-26TOERSSS
SL001067142013-11-12TOERSSS
SL001067232013-04-17NARE    
SL001067232013-04-17NARE    
SL001067392013-04-14THIGHRSSS
SL001067812010-05-04SKINSSS 
SL001067812010-05-04TOESSS 
SL001067812012-11-25CHINRRSS
SL001067812013-01-07BLOODRRS 
SL001067812013-01-07BLOODRRS 
SL001067812013-02-01BLOODRRSS
SL001067812013-05-22ELBOWRRSS
SL001067812013-07-26ELBOWRRSS

 

 

 

For example, for records between row 7 and 14, I would want to eliminate row 11

Is there a way to do this?

 

Thank you very much in advance!

Super User
Posts: 23,342

Re: How to eliminate "duplicate" records based on a date criteria (not created)

1. Use DIF() to calculate the difference between dates

2. Use BY group processing to identify your groups

 

untested, but some thing like this should work. 

 

Please post your data as text in the future, not an image. I'm not about to type it out and write code to answer the question.

 

Screen Shot 2017-10-26 at 9.31.25 PM.png

 

 

Super User
Posts: 23,342

Re: How to eliminate "duplicate" records based on a date criteria (not created)

The IF condition needs to be tweaked to account for missing as well, i.e. not missing(days_diff)

New Contributor
Posts: 4

Re: How to eliminate "duplicate" records based on a date criteria (not created)

Thank you - and sorry! I have edited my post now.

I will try your suggestion!

New Contributor
Posts: 4

Re: How to eliminate "duplicate" records based on a date criteria (not created)

 

Hi Reeza,

 

 

Could you please explain the "if first.g then .. "? I think I am missing something.

I was able to get rid of some observations, but it seems to be deleting more than I want.

This is what I have tried:

 

data csas.mrsanew3;
set csas.mrsa2;

by HRN cdate source oxacillin erythromycin clindamycin tetracycline;

days_diff = dif(cdate); /* Calculating the difference between days */

if days_diff <=42 then delete; /* if difference between days less or equal to 42 then delete */

run;

 

Previously I have used a b c d e f g as examples but:

a=HRN b=cdate c=source d=oxacillin e=erythromycin e=clindamycin g=tetracycline

 

I have 53 variables in the dataset but I am only interested in eliminating observations based on:

- if the values in HRN, source, oxacillin, erythromycin, clindamycin and tetracycline are same for the observations AND

- if the difference between the collection dates (cdate) are less than 42 days in these observations

 

Thank you so much for your help!

Super User
Posts: 23,342

Re: How to eliminate "duplicate" records based on a date criteria (not created)

First.G indicates if this is a new group of (A C D E F G) or if it's the same group. If FIRST.G is 1 then it's a new group. 

 

Here's the reference for BY Group processing in SAS:

http://documentation.sas.com/?docsetId=lrcon&docsetTarget=p0tq11jtmrhsd4n1co4p5tu1fbsi.htm&docsetVer...

 

This is one of the most powerful features in SAS, so you should take some time to understand how it works.

 

My explanation:

Look at your data. What happens to the DIF variable when it's not the same group? You're currently keeping it and using it in your logic, whereas my code had set it to missing. I didn't account for missing, but did mention that you'd need to. You currently are not doing that, I don't know why you ignored that suggestion but you need to include it. If you incorporate those suggestions in your code it works on your example data.

 

If it doesn't, post your examples where it doesn't work and include an update of your sample data and your code with details on how it's not working. Not working doesn't tell us anything, it could be a missing semicolon or logic flaw. 

 

 

 

 

 

 

 

 

 

New Contributor
Posts: 4

Re: How to eliminate "duplicate" records based on a date criteria (not created)

[ Edited ]

Thank you so much for your quick response! It's been a really long time I haven't used SAS which is why I feel so lost.

I understand from your response I have to:

- Set groups using first and last variables

- Set dif variable to . when it's not the same group

- Account for missing values

 

I will try this again!! Thank you

Super User
Posts: 23,342

Re: How to eliminate "duplicate" records based on a date criteria (not created)

In the future, please post your data as a data step as well. 

I'm posting this here incase anyone else needs it or if you have future issues to help debug.

 

data have;
infile cards dlm=',' dsd;
informat a $10. b yymmdd10. c d e f g $10.;
format b date9.;
input a $ b c d e f g;
cards;
SL00106714, 2013-08-26,	TOE,	R,	S,	S,	S
SL00106714, 2013-11-12,	TOE,	R,	S,	S,	S
SL00106723, 2013-04-17,	NARE,	 ,	 ,	 ,	 
SL00106723, 2013-04-17,	NARE,	 ,	 ,	 ,	 
SL00106739, 2013-04-14,	THIGH,	R,	S,	S,	S
SL00106781, 2010-05-04,	SKIN,	S,	S,	S,	 
SL00106781, 2010-05-04,	TOE,	S,	S,	S,	 
SL00106781, 2012-11-25,	CHIN,	R,	R,	S,	S
SL00106781, 2013-01-07,	BLOOD,	R,	R,	S,	 
SL00106781, 2013-01-07,	BLOOD,	R,	R,	S,	 
SL00106781, 2013-02-01,	BLOOD,	R,	R,	S,	S
SL00106781, 2013-05-22,	ELBOW,	R,	R,	S,	S
SL00106781, 2013-07-26,	ELBOW,	R,	R,	S,	S
;
run;
Respected Advisor
Posts: 4,692

Re: How to eliminate "duplicate" records based on a date criteria (not created)

[ Edited ]

@djeon

If you follow @Reeza advice and post in the future sample data as a working data step then you've got also a much better chance that someone is motivated to provide some code to you.

 

I've made in below code the assumption that if you have a case with multiple records which are less than 42 days apart from the previous one, you still would want to keep the records with more than 42 days difference - i.e. if record 1,2 and 3 have less than 42 days difference to the previous one BUT records 1 and 3 have a difference greater 42 days then keep record 1 AND 3. 

data have;
  infile cards dlm=',' dsd;
  informat a $10. b yymmdd10. c d e f g $10.;
  format b date9.;
  input a $ b c d e f g;
  cards;
SL00106714, 2013-08-26, TOE,  R,  S,  S,  S
SL00106714, 2013-11-12, TOE,  R,  S,  S,  S
SL00106723, 2013-04-17, NARE,  ,   ,   ,   
SL00106723, 2013-04-17, NARE,  ,   ,   ,   
SL00106739, 2013-04-14, THIGH,  R,  S,  S,  S
SL00106781, 2010-05-04, SKIN, S,  S,  S,   
SL00106781, 2010-05-04, TOE,  S,  S,  S,   
SL00106781, 2012-11-25, CHIN, R,  R,  S,  S
SL00106781, 2013-01-07, BLOOD,  R,  R,  S,   
SL00106781, 2013-01-07, BLOOD,  R,  R,  S,   
SL00106781, 2013-02-01, BLOOD,  R,  R,  S,  S
SL00106781, 2013-05-22, ELBOW,  R,  R,  S,  S
SL00106781, 2013-07-26, ELBOW,  R,  R,  S,  S
;
run;

proc sort data=have out=inter;
  by a c d e f g b;
run;

data want;
  set inter;
  by a c d e f g b;
  
  format _lag_b _last_kept_b date9.;
  retain _last_kept_b;
  _lag_b=lag(b);

  if first.g then
    do;
      _last_kept_b=b;
    end;
  else
    do;
      if missing(_last_kept_b) then _last_kept_b=b;
      if 0<=(b - _lag_b)<=42 and 0<=(b - _last_kept_b)<=42 then delete;
      else _last_kept_b=b;
    end;

run;

N.B: The LAG() and DIF() function must execute in every single iteration of a data step so never use them within a conditional code block. That's the reason I couldn't use DIF() in my code logic.

Ask a Question
Discussion stats
  • 8 replies
  • 143 views
  • 5 likes
  • 3 in conversation