10-26-2017 10:51 PM - edited 10-26-2017 11:38 PM
I am trying to get rid of some duplicate records based on:
1. If A, C, D, E, F, G is same AND
2. The difference between B is lower than 42 days
For example, for records between row 7 and 14, I would want to eliminate row 11
Is there a way to do this?
Thank you very much in advance!
10-26-2017 11:32 PM
1. Use DIF() to calculate the difference between dates
2. Use BY group processing to identify your groups
untested, but some thing like this should work.
Please post your data as text in the future, not an image. I'm not about to type it out and write code to answer the question.
10-28-2017 02:59 PM
Could you please explain the "if first.g then .. "? I think I am missing something.
I was able to get rid of some observations, but it seems to be deleting more than I want.
This is what I have tried:
by HRN cdate source oxacillin erythromycin clindamycin tetracycline;
days_diff = dif(cdate); /* Calculating the difference between days */
if days_diff <=42 then delete; /* if difference between days less or equal to 42 then delete */
Previously I have used a b c d e f g as examples but:
a=HRN b=cdate c=source d=oxacillin e=erythromycin e=clindamycin g=tetracycline
I have 53 variables in the dataset but I am only interested in eliminating observations based on:
- if the values in HRN, source, oxacillin, erythromycin, clindamycin and tetracycline are same for the observations AND
- if the difference between the collection dates (cdate) are less than 42 days in these observations
Thank you so much for your help!
10-28-2017 03:26 PM
First.G indicates if this is a new group of (A C D E F G) or if it's the same group. If FIRST.G is 1 then it's a new group.
Here's the reference for BY Group processing in SAS:
This is one of the most powerful features in SAS, so you should take some time to understand how it works.
Look at your data. What happens to the DIF variable when it's not the same group? You're currently keeping it and using it in your logic, whereas my code had set it to missing. I didn't account for missing, but did mention that you'd need to. You currently are not doing that, I don't know why you ignored that suggestion but you need to include it. If you incorporate those suggestions in your code it works on your example data.
If it doesn't, post your examples where it doesn't work and include an update of your sample data and your code with details on how it's not working. Not working doesn't tell us anything, it could be a missing semicolon or logic flaw.
10-28-2017 03:33 PM - edited 10-28-2017 03:34 PM
Thank you so much for your quick response! It's been a really long time I haven't used SAS which is why I feel so lost.
I understand from your response I have to:
- Set groups using first and last variables
- Set dif variable to . when it's not the same group
- Account for missing values
I will try this again!! Thank you
10-28-2017 03:39 PM
In the future, please post your data as a data step as well.
I'm posting this here incase anyone else needs it or if you have future issues to help debug.
data have; infile cards dlm=',' dsd; informat a $10. b yymmdd10. c d e f g $10.; format b date9.; input a $ b c d e f g; cards; SL00106714, 2013-08-26, TOE, R, S, S, S SL00106714, 2013-11-12, TOE, R, S, S, S SL00106723, 2013-04-17, NARE, , , , SL00106723, 2013-04-17, NARE, , , , SL00106739, 2013-04-14, THIGH, R, S, S, S SL00106781, 2010-05-04, SKIN, S, S, S, SL00106781, 2010-05-04, TOE, S, S, S, SL00106781, 2012-11-25, CHIN, R, R, S, S SL00106781, 2013-01-07, BLOOD, R, R, S, SL00106781, 2013-01-07, BLOOD, R, R, S, SL00106781, 2013-02-01, BLOOD, R, R, S, S SL00106781, 2013-05-22, ELBOW, R, R, S, S SL00106781, 2013-07-26, ELBOW, R, R, S, S ; run;
10-28-2017 07:45 PM - edited 10-28-2017 07:53 PM
If you follow @Reeza advice and post in the future sample data as a working data step then you've got also a much better chance that someone is motivated to provide some code to you.
I've made in below code the assumption that if you have a case with multiple records which are less than 42 days apart from the previous one, you still would want to keep the records with more than 42 days difference - i.e. if record 1,2 and 3 have less than 42 days difference to the previous one BUT records 1 and 3 have a difference greater 42 days then keep record 1 AND 3.
data have; infile cards dlm=',' dsd; informat a $10. b yymmdd10. c d e f g $10.; format b date9.; input a $ b c d e f g; cards; SL00106714, 2013-08-26, TOE, R, S, S, S SL00106714, 2013-11-12, TOE, R, S, S, S SL00106723, 2013-04-17, NARE, , , , SL00106723, 2013-04-17, NARE, , , , SL00106739, 2013-04-14, THIGH, R, S, S, S SL00106781, 2010-05-04, SKIN, S, S, S, SL00106781, 2010-05-04, TOE, S, S, S, SL00106781, 2012-11-25, CHIN, R, R, S, S SL00106781, 2013-01-07, BLOOD, R, R, S, SL00106781, 2013-01-07, BLOOD, R, R, S, SL00106781, 2013-02-01, BLOOD, R, R, S, S SL00106781, 2013-05-22, ELBOW, R, R, S, S SL00106781, 2013-07-26, ELBOW, R, R, S, S ; run; proc sort data=have out=inter; by a c d e f g b; run; data want; set inter; by a c d e f g b; format _lag_b _last_kept_b date9.; retain _last_kept_b; _lag_b=lag(b); if first.g then do; _last_kept_b=b; end; else do; if missing(_last_kept_b) then _last_kept_b=b; if 0<=(b - _lag_b)<=42 and 0<=(b - _last_kept_b)<=42 then delete; else _last_kept_b=b; end; run;
N.B: The LAG() and DIF() function must execute in every single iteration of a data step so never use them within a conditional code block. That's the reason I couldn't use DIF() in my code logic.