Hi,
In the following dataset I want to remove duplicates based on several conditions:
data have;
input Id Disease $ DOB :mmddyy10. Date :mmddyy10.;
format DOB mmddyy10. date mmddyy10.;
datalines;
123 Auris 01/08/1961 01/01/2018
123 CRE 01/08/1961 09/02/2020
344 CRE 02/12/1956 08/06/2019
344 CRE 02/12/1956 03/06/2020
344 CRE 02/12/1956 03/03/2022
323 CRE 07/01/1993 01/06/2019
323 CRE 07/01/1993 09/06/2020
323 CRE 07/01/1993 09/31/2020
167 Acino 12/09/2001 03/06/2019
167 Acino 12/09/2001 04/31/2020
167 Acino 12/09/2001 09/03/2021
912 CRE 03/01/2012 03/03/2018
912 CRE 03/01/2012 05/06/2019
912 CRE 03/01/2012 09/06/2020
256 Auris 05/27/1983 08/05/2020
256 Auris 05/27/1983 12/07/2020
256 Auris 05/27/1983 10/07/2021
256 Auris 05/27/1983 02/07/2022
317 Acino 07/17/1985 12/07/2018
317 Acino 07/17/1985 01/03/2018
409 CRE 08/07/1987 03/03/2018
409 CRE 08/07/1987 05/06/2019
409 CRE 08/07/1987 09/06/2019
409 CRE 08/07/1987 10/06/2021
;;;;
run;
The result should be as this:
123 Auris 01/08/1961 01/01/2018
123 CRE 01/08/1961 09/02/2020
344 CRE 02/12/1956 08/06/2019
344 CRE 02/12/1956 03/03/2022
323 CRE 07/01/1993 01/06/2019
323 CRE 07/01/1993 09/31/2020
167 Acino 12/09/2001 03/06/2019
167 Acino 12/09/2001 09/03/2021
912 CRE 03/01/2012 03/03/2018
912 CRE 03/01/2012 05/06/2019
912 CRE 03/01/2012 09/06/2020
256 Auris 05/27/1983 08/05/2020
317 Acino 07/17/1985 12/07/2018
317 Acino 07/17/1985 01/03/2018
409 CRE 08/07/1987 03/03/2018
409 CRE 08/07/1987 05/06/2019
409 CRE 08/07/1987 10/06/2021
Please advise,
Thank you
Thanks for writing a DATA step to generate dataset HAVE. But did you run it and read your log? The data has two invalid dates: 9/31/2020 and 4/31/2020).
And to clarify "keep all obs that have a date difference of more than 12 months". Do you mean gaps of more than three months between successive records for a CRE/DOB group? Or do you mean more than 12 months from the DATE of the first record?
What I meant is obs with "date" var. which is the testing date. So if the first test was done on 01/02/2019 it will be the first event. Any other test/s for CRE before 01/02/2020 should be removed and the first test will be kept only. If this same person had more tests after 01/02/2020, such as 02/07/2021 obs will be kept. If he had another one on 04/08/2021, we should keep this last test and keep 02/07/2021.
Regarding your comment above that I put in bold italics - for ID 323 disease CRE your data has
323 CRE 07/01/1993 01/06/2019
323 CRE 07/01/1993 09/06/2020
323 CRE 07/01/1993 09/30/2020, corrected from 09/31/2020
which based on your description suggests to me that you would keep the first date (01/06/2019) and the second and third dates, both of which fall after the one year mark (01/06/2020).
But your example expected result only has 01/06/2019 and 09/30/2020 (corrected from 09/31/2020). Missing is 09/06/2020. So is the expected result wrong, or is my understanding of your selection rule in error?
Assuming in the case of disease='CRE' you want the first observation and all observations more than a year later, then:
data want (where=(disease='CRE') drop=_:);
set have;
by id disease dob notsorted;
retain _cutoff_date;
if first.dob then do;
if disease='CRE' then _cutoff_date=intnx('year',date,1,'sameday');
else _cutoff_date='31dec9999'd;
end;
if (disease='Auris' and first.dob=1)
or (disease='Acino' and (first.dob=1 or last.dob=1))
or (disease='CRE' and (first.dob=1 or date>_cutoff_date));
run;
This assumes that data for a given ID/DISEASE/DOB are grouped into consecutive observations.
Edit note: It also assumes that within a given ID/DISEASE/DOB group, the data are sorted by DATE.
Hi mkeintz,
Thank you for your reply. I ran the code but it did not work as intended. The result deleted all obs for diseases (Acino and Auris) other than CRE. Also for CRE, there are some obs that should've been deleted but they're still in the result (bold italic).
Thank you
This is what I got as a result:
123 | CRE | 01/08/1961 | 09/02/2020 |
344 | CRE | 02/12/1956 | 08/06/2019 |
344 | CRE | 02/12/1956 | 03/03/2022 |
323 | CRE | 07/01/1993 | 01/06/2019 |
323 | CRE | 07/01/1993 | 09/06/2020 |
323 | CRE | 07/01/1993 | 09/30/2020 |
912 | CRE | 03/01/2012 | 03/03/2018 |
912 | CRE | 03/01/2012 | 05/06/2019 |
912 | CRE | 03/01/2012 | 09/06/2020 |
409 | CRE | 08/07/1987 | 03/03/2018 |
409 | CRE | 08/07/1987 | 05/06/2019 |
409 | CRE | 08/07/1987 | 09/06/2019 |
409 | CRE | 08/07/1987 | 10/06/2021 |
Please take another look at my response.
I said
Assuming in the case of disease='CRE' you want the first observation and all observations more than a year later, then:
So my code was specitically referring to CRE ("in the case of disease='CRE"). Notice in the first statement (the DATA statement), there is a filter to keep CRE data only. Just remove that filter.
Second, I stated, based on your previous comments, that for CRE "you want the first observation and all observations more than a year later". If that was wrong, then please restate your criterion for CRE cases, which I clearly do not understand. Pointing out the observations that you don't want doesn't help me, since they fit perfectly my stated understanding.
Hi mkeintz,
Thank you so much for the help. I got what you said about the CRE filtering. For the criterion, sorry for not making it clear. The criteria are as follows:
Hope this is clear now.
Thank you.
Are you looking to count distinct incidences of disease? Is the one year window form the start of the incident? Or from the previous observation? For example if you had one record per month for 2 years is that one incident or two?
Hi Tom,
Hi Tom,
The one-year window is from the previous obs (test). For example,
Thank you.
Hi,
How about working on deleting the obs using 2 steps. Meaning removing or retaining the second obs depending on the first step coding and then removing or retaining the third obs depending on a second step coding.
Thank you.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.