Hello everyone,
I am trying to get rid of some duplicate records based on:
1. If A, C, D, E, F, G is same AND
2. The difference between B is lower than 42 days
| A | B | C | D | E | F | G | 
| SL00106714 | 2013-08-26 | TOE | R | S | S | S | 
| SL00106714 | 2013-11-12 | TOE | R | S | S | S | 
| SL00106723 | 2013-04-17 | NARE | ||||
| SL00106723 | 2013-04-17 | NARE | ||||
| SL00106739 | 2013-04-14 | THIGH | R | S | S | S | 
| SL00106781 | 2010-05-04 | SKIN | S | S | S | |
| SL00106781 | 2010-05-04 | TOE | S | S | S | |
| SL00106781 | 2012-11-25 | CHIN | R | R | S | S | 
| SL00106781 | 2013-01-07 | BLOOD | R | R | S | |
| SL00106781 | 2013-01-07 | BLOOD | R | R | S | |
| SL00106781 | 2013-02-01 | BLOOD | R | R | S | S | 
| SL00106781 | 2013-05-22 | ELBOW | R | R | S | S | 
| SL00106781 | 2013-07-26 | ELBOW | R | R | S | S | 
For example, for records between row 7 and 14, I would want to eliminate row 11
Is there a way to do this?
Thank you very much in advance!
1. Use DIF() to calculate the difference between dates
2. Use BY group processing to identify your groups
untested, but some thing like this should work.
Please post your data as text in the future, not an image. I'm not about to type it out and write code to answer the question.
The IF condition needs to be tweaked to account for missing as well, i.e. not missing(days_diff)
Thank you - and sorry! I have edited my post now.
I will try your suggestion!
Hi Reeza,
Could you please explain the "if first.g then .. "? I think I am missing something.
I was able to get rid of some observations, but it seems to be deleting more than I want.
This is what I have tried:
data csas.mrsanew3;
set csas.mrsa2;
by HRN cdate source oxacillin erythromycin clindamycin tetracycline;
days_diff = dif(cdate); /* Calculating the difference between days */
if days_diff <=42 then delete; /* if difference between days less or equal to 42 then delete */
run;
Previously I have used a b c d e f g as examples but:
a=HRN b=cdate c=source d=oxacillin e=erythromycin e=clindamycin g=tetracycline
I have 53 variables in the dataset but I am only interested in eliminating observations based on:
- if the values in HRN, source, oxacillin, erythromycin, clindamycin and tetracycline are same for the observations AND
- if the difference between the collection dates (cdate) are less than 42 days in these observations
Thank you so much for your help!
First.G indicates if this is a new group of (A C D E F G) or if it's the same group. If FIRST.G is 1 then it's a new group.
Here's the reference for BY Group processing in SAS:
This is one of the most powerful features in SAS, so you should take some time to understand how it works.
My explanation:
Look at your data. What happens to the DIF variable when it's not the same group? You're currently keeping it and using it in your logic, whereas my code had set it to missing. I didn't account for missing, but did mention that you'd need to. You currently are not doing that, I don't know why you ignored that suggestion but you need to include it. If you incorporate those suggestions in your code it works on your example data.
If it doesn't, post your examples where it doesn't work and include an update of your sample data and your code with details on how it's not working. Not working doesn't tell us anything, it could be a missing semicolon or logic flaw.
Thank you so much for your quick response! It's been a really long time I haven't used SAS which is why I feel so lost.
I understand from your response I have to:
- Set groups using first and last variables
- Set dif variable to . when it's not the same group
- Account for missing values
I will try this again!! Thank you
In the future, please post your data as a data step as well.
I'm posting this here incase anyone else needs it or if you have future issues to help debug.
data have;
infile cards dlm=',' dsd;
informat a $10. b yymmdd10. c d e f g $10.;
format b date9.;
input a $ b c d e f g;
cards;
SL00106714, 2013-08-26,	TOE,	R,	S,	S,	S
SL00106714, 2013-11-12,	TOE,	R,	S,	S,	S
SL00106723, 2013-04-17,	NARE,	 ,	 ,	 ,	 
SL00106723, 2013-04-17,	NARE,	 ,	 ,	 ,	 
SL00106739, 2013-04-14,	THIGH,	R,	S,	S,	S
SL00106781, 2010-05-04,	SKIN,	S,	S,	S,	 
SL00106781, 2010-05-04,	TOE,	S,	S,	S,	 
SL00106781, 2012-11-25,	CHIN,	R,	R,	S,	S
SL00106781, 2013-01-07,	BLOOD,	R,	R,	S,	 
SL00106781, 2013-01-07,	BLOOD,	R,	R,	S,	 
SL00106781, 2013-02-01,	BLOOD,	R,	R,	S,	S
SL00106781, 2013-05-22,	ELBOW,	R,	R,	S,	S
SL00106781, 2013-07-26,	ELBOW,	R,	R,	S,	S
;
run;If you follow @Reeza advice and post in the future sample data as a working data step then you've got also a much better chance that someone is motivated to provide some code to you.
I've made in below code the assumption that if you have a case with multiple records which are less than 42 days apart from the previous one, you still would want to keep the records with more than 42 days difference - i.e. if record 1,2 and 3 have less than 42 days difference to the previous one BUT records 1 and 3 have a difference greater 42 days then keep record 1 AND 3.
data have;
  infile cards dlm=',' dsd;
  informat a $10. b yymmdd10. c d e f g $10.;
  format b date9.;
  input a $ b c d e f g;
  cards;
SL00106714, 2013-08-26, TOE,  R,  S,  S,  S
SL00106714, 2013-11-12, TOE,  R,  S,  S,  S
SL00106723, 2013-04-17, NARE,  ,   ,   ,   
SL00106723, 2013-04-17, NARE,  ,   ,   ,   
SL00106739, 2013-04-14, THIGH,  R,  S,  S,  S
SL00106781, 2010-05-04, SKIN, S,  S,  S,   
SL00106781, 2010-05-04, TOE,  S,  S,  S,   
SL00106781, 2012-11-25, CHIN, R,  R,  S,  S
SL00106781, 2013-01-07, BLOOD,  R,  R,  S,   
SL00106781, 2013-01-07, BLOOD,  R,  R,  S,   
SL00106781, 2013-02-01, BLOOD,  R,  R,  S,  S
SL00106781, 2013-05-22, ELBOW,  R,  R,  S,  S
SL00106781, 2013-07-26, ELBOW,  R,  R,  S,  S
;
run;
proc sort data=have out=inter;
  by a c d e f g b;
run;
data want;
  set inter;
  by a c d e f g b;
  
  format _lag_b _last_kept_b date9.;
  retain _last_kept_b;
  _lag_b=lag(b);
  if first.g then
    do;
      _last_kept_b=b;
    end;
  else
    do;
      if missing(_last_kept_b) then _last_kept_b=b;
      if 0<=(b - _lag_b)<=42 and 0<=(b - _last_kept_b)<=42 then delete;
      else _last_kept_b=b;
    end;
run;
N.B: The LAG() and DIF() function must execute in every single iteration of a data step so never use them within a conditional code block. That's the reason I couldn't use DIF() in my code logic.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
