BookmarkSubscribeRSS Feed
djeon
Fluorite | Level 6

Hello everyone,

 

I am trying to get rid of some duplicate records based on:

1. If A, C, D, E, F, G is same AND

2. The difference between B is lower than 42 days 

 

 

 

ABCDEFG
SL001067142013-08-26TOERSSS
SL001067142013-11-12TOERSSS
SL001067232013-04-17NARE    
SL001067232013-04-17NARE    
SL001067392013-04-14THIGHRSSS
SL001067812010-05-04SKINSSS 
SL001067812010-05-04TOESSS 
SL001067812012-11-25CHINRRSS
SL001067812013-01-07BLOODRRS 
SL001067812013-01-07BLOODRRS 
SL001067812013-02-01BLOODRRSS
SL001067812013-05-22ELBOWRRSS
SL001067812013-07-26ELBOWRRSS

 

 

 

For example, for records between row 7 and 14, I would want to eliminate row 11

Is there a way to do this?

 

Thank you very much in advance!

8 REPLIES 8
Reeza
Super User

1. Use DIF() to calculate the difference between dates

2. Use BY group processing to identify your groups

 

untested, but some thing like this should work. 

 

Please post your data as text in the future, not an image. I'm not about to type it out and write code to answer the question.

 

Screen Shot 2017-10-26 at 9.31.25 PM.png

 

 

Reeza
Super User

The IF condition needs to be tweaked to account for missing as well, i.e. not missing(days_diff)

djeon
Fluorite | Level 6

Thank you - and sorry! I have edited my post now.

I will try your suggestion!

djeon
Fluorite | Level 6

 

Hi Reeza,

 

 

Could you please explain the "if first.g then .. "? I think I am missing something.

I was able to get rid of some observations, but it seems to be deleting more than I want.

This is what I have tried:

 

data csas.mrsanew3;
set csas.mrsa2;

by HRN cdate source oxacillin erythromycin clindamycin tetracycline;

days_diff = dif(cdate); /* Calculating the difference between days */

if days_diff <=42 then delete; /* if difference between days less or equal to 42 then delete */

run;

 

Previously I have used a b c d e f g as examples but:

a=HRN b=cdate c=source d=oxacillin e=erythromycin e=clindamycin g=tetracycline

 

I have 53 variables in the dataset but I am only interested in eliminating observations based on:

- if the values in HRN, source, oxacillin, erythromycin, clindamycin and tetracycline are same for the observations AND

- if the difference between the collection dates (cdate) are less than 42 days in these observations

 

Thank you so much for your help!

Reeza
Super User

First.G indicates if this is a new group of (A C D E F G) or if it's the same group. If FIRST.G is 1 then it's a new group. 

 

Here's the reference for BY Group processing in SAS:

http://documentation.sas.com/?docsetId=lrcon&docsetTarget=p0tq11jtmrhsd4n1co4p5tu1fbsi.htm&docsetVer...

 

This is one of the most powerful features in SAS, so you should take some time to understand how it works.

 

My explanation:

Look at your data. What happens to the DIF variable when it's not the same group? You're currently keeping it and using it in your logic, whereas my code had set it to missing. I didn't account for missing, but did mention that you'd need to. You currently are not doing that, I don't know why you ignored that suggestion but you need to include it. If you incorporate those suggestions in your code it works on your example data.

 

If it doesn't, post your examples where it doesn't work and include an update of your sample data and your code with details on how it's not working. Not working doesn't tell us anything, it could be a missing semicolon or logic flaw. 

 

 

 

 

 

 

 

 

 

djeon
Fluorite | Level 6

Thank you so much for your quick response! It's been a really long time I haven't used SAS which is why I feel so lost.

I understand from your response I have to:

- Set groups using first and last variables

- Set dif variable to . when it's not the same group

- Account for missing values

 

I will try this again!! Thank you

Reeza
Super User

In the future, please post your data as a data step as well. 

I'm posting this here incase anyone else needs it or if you have future issues to help debug.

 

data have;
infile cards dlm=',' dsd;
informat a $10. b yymmdd10. c d e f g $10.;
format b date9.;
input a $ b c d e f g;
cards;
SL00106714, 2013-08-26,	TOE,	R,	S,	S,	S
SL00106714, 2013-11-12,	TOE,	R,	S,	S,	S
SL00106723, 2013-04-17,	NARE,	 ,	 ,	 ,	 
SL00106723, 2013-04-17,	NARE,	 ,	 ,	 ,	 
SL00106739, 2013-04-14,	THIGH,	R,	S,	S,	S
SL00106781, 2010-05-04,	SKIN,	S,	S,	S,	 
SL00106781, 2010-05-04,	TOE,	S,	S,	S,	 
SL00106781, 2012-11-25,	CHIN,	R,	R,	S,	S
SL00106781, 2013-01-07,	BLOOD,	R,	R,	S,	 
SL00106781, 2013-01-07,	BLOOD,	R,	R,	S,	 
SL00106781, 2013-02-01,	BLOOD,	R,	R,	S,	S
SL00106781, 2013-05-22,	ELBOW,	R,	R,	S,	S
SL00106781, 2013-07-26,	ELBOW,	R,	R,	S,	S
;
run;
Patrick
Opal | Level 21

@djeon

If you follow @Reeza advice and post in the future sample data as a working data step then you've got also a much better chance that someone is motivated to provide some code to you.

 

I've made in below code the assumption that if you have a case with multiple records which are less than 42 days apart from the previous one, you still would want to keep the records with more than 42 days difference - i.e. if record 1,2 and 3 have less than 42 days difference to the previous one BUT records 1 and 3 have a difference greater 42 days then keep record 1 AND 3. 

data have;
  infile cards dlm=',' dsd;
  informat a $10. b yymmdd10. c d e f g $10.;
  format b date9.;
  input a $ b c d e f g;
  cards;
SL00106714, 2013-08-26, TOE,  R,  S,  S,  S
SL00106714, 2013-11-12, TOE,  R,  S,  S,  S
SL00106723, 2013-04-17, NARE,  ,   ,   ,   
SL00106723, 2013-04-17, NARE,  ,   ,   ,   
SL00106739, 2013-04-14, THIGH,  R,  S,  S,  S
SL00106781, 2010-05-04, SKIN, S,  S,  S,   
SL00106781, 2010-05-04, TOE,  S,  S,  S,   
SL00106781, 2012-11-25, CHIN, R,  R,  S,  S
SL00106781, 2013-01-07, BLOOD,  R,  R,  S,   
SL00106781, 2013-01-07, BLOOD,  R,  R,  S,   
SL00106781, 2013-02-01, BLOOD,  R,  R,  S,  S
SL00106781, 2013-05-22, ELBOW,  R,  R,  S,  S
SL00106781, 2013-07-26, ELBOW,  R,  R,  S,  S
;
run;

proc sort data=have out=inter;
  by a c d e f g b;
run;

data want;
  set inter;
  by a c d e f g b;
  
  format _lag_b _last_kept_b date9.;
  retain _last_kept_b;
  _lag_b=lag(b);

  if first.g then
    do;
      _last_kept_b=b;
    end;
  else
    do;
      if missing(_last_kept_b) then _last_kept_b=b;
      if 0<=(b - _lag_b)<=42 and 0<=(b - _last_kept_b)<=42 then delete;
      else _last_kept_b=b;
    end;

run;

N.B: The LAG() and DIF() function must execute in every single iteration of a data step so never use them within a conditional code block. That's the reason I couldn't use DIF() in my code logic.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1011 views
  • 5 likes
  • 3 in conversation