I have a data set (sample below). I need to remove duplicates with a condition on specimen_date. If specimen_date difference between observations for the same person was less than 12 months then remove, else keep.
data have;
input Id sex $ first_name $ last_name $ DOB :mmddyy10. specimen_Date :mmddyy10.;
format DOB mmddyy10. specimen_date mmddyy10.;
datalines;
123 M Mickey Mouse 01/08/1961 01/01/2018
123 M Mickey Mouse 01/08/1961 09/02/2020
322 F Minnie Mouse 02/12/1956 08/06/2019
344 M Donald Duck 02/12/1956 03/06/2020
344 M Donald Duck 02/12/1956 03/08/2020
323 M Donald Duck 02/12/1956 01/06/2020
323 M Daffy Duck 07/01/1993 09/06/2020
325 M Daffy Duck 07/01/1993 05/06/2020
333 F Betty Boop 09/01/1993 03/30/2020
167 F Betty Boop 09/01/1993 03/06/2019
245 F Betty Boop 09/01/1993 04/30/2020
167 F Betty Boop 09/01/1993 11/03/2021
344 M . . 02/12/1956 03/09/2020
344 M . . 02/12/1956 04/09/2020
;;;;
run;
I used this code but I did not get the intended data:
proc sort data=have out=have_sorted;
by sex dob first_name last_name specimen_date;
run;
data want (drop=_:);
set have_sorted;
by sex dob first_name last_name;
retain _maxdate;
if first.dob then call missing(_maxdate);
if specimen_date-360<=_maxdate then delete;
_maxdate=specimen_date;
run;
This is what I'm supposed to get but I'm not getting the bolded ones:
123 M Mickey Mouse 01/08/1961 01/01/2018
123 M Mickey Mouse 01/08/1961 09/02/2020
322 F Minnie Mouse 02/12/1956 08/06/2019
344 M Donald Duck 02/12/1956 01/06/2020
325 M Daffy Duck 07/01/1993 05/06/2020
333 F Betty Boop 09/01/1993 09/31/2020
167 F Betty Boop 09/01/1993 03/06/2019
167 F Betty Boop 09/01/1993 11/03/2021
344 M . . 02/12/1956 03/09/2020
Any help much appreciated.
Thank you
I would suggest adding put statements into the code, and reviewing the SAS log to follow the data flow/logic
344 M Donald Duck 02/12/1956 01/06/2020
The if specimen_date-360<=_maxdate then delete; condition is true for this record so it is deleted
first_name=Donald last_name=Duck Id=323 DOB=02/12/1956 specimen_Date=01/06/2020
specimen_Date=21920 specimen_Date=01/06/2020
_maxdate =21983 _maxdate =09/03/2020
167 F Betty Boop 09/01/1993 11/03/2021
This is in the want dataset when I ran your code
Can you describe the logic for selecting/deleting the observations , as I'm not sure I really understand what you are attempting to achieve
Hi AMSAS,
Thank you first for your reply.
This is what I get when I run the code (below):
What I wanted exactly to keep the first observation for a person ( same dob, first_name, and last_name) then:
- remove all other obs with specimen_date less than the first (previous) specimen_date by 12 months (1 year).
- keep the obs with specimen date different than the previous specimen date by more than 12 months (1 year).
So 323 Donald Duck with 01/06/2020 should stay as the one first specimen date and the rest should be removed because both have specimen dates less than 12 months difference from the first obs (03/06/2020 and 03/08/2020). * I noticed this happening when there is one or more obs with the same dob (same for Donald Duck) and missing names such as :
230 M . . 02/12/1956 03/09/2020
344 M . . 02/12/1956 04/09/2020
For Betty, you're right it should not show since it has a difference of less than one year from the previous obs. My mistake, sorry about that.
Thank you,
322 | F | Minnie | Mouse | 02/12/1956 | 08/06/2019 |
167 | F | Betty | Boop | 09/01/1993 | 03/06/2019 |
333 | F | Betty | Boop | 09/01/1993 | 03/30/2020 |
230 | M | 02/12/1956 | 03/09/2020 | ||
123 | M | Mickey | Mouse | 01/08/1961 | 01/01/2018 |
123 | M | Mickey | Mouse | 01/08/1961 | 09/02/2020 |
325 | M | Daffy | Duck | 07/01/1993 | 05/06/2020 |
First a generic comment about this code:
if specimen_date-360<=_maxdate then delete;
360 days is not 12 months. So you need to decide do you want 12 months or 360 days for a comparison. The INTNX function is used to increment dates by intervals like month and year with the 'S' parameter to do actual calendar dates.
IF you want 360 days you might look at the YRDIF function as it has two different 360 day "year" options
if yrdif(specimen_date, maxdate, 'ACT/360') >1 then delete;
or
if yrdif(specimen_date, maxdate, '30/360') >1 then delete;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.