BookmarkSubscribeRSS Feed
mayasak
Quartz | Level 8

Hi,

In the following dataset I want to remove duplicates based on several conditions:

  • For "disease" = Auris, keep only first observation if "id" and "DOB" are identical.
  • For "disease" = Acino keep first and last obs if "id" and "DOB" are identical.
  • For "disease" = CRE, if "id" and "DOB" are identical, keep all obs that have a date difference of more than 12 months, else keep the first obs and delete obs with < 12 months difference.

data have;
input Id Disease $ DOB :mmddyy10. Date :mmddyy10.;
format DOB mmddyy10. date mmddyy10.;
datalines;
123 Auris 01/08/1961 01/01/2018
123 CRE 01/08/1961 09/02/2020
344 CRE 02/12/1956 08/06/2019
344 CRE 02/12/1956 03/06/2020
344 CRE 02/12/1956 03/03/2022
323 CRE 07/01/1993 01/06/2019
323 CRE 07/01/1993 09/06/2020
323 CRE 07/01/1993 09/31/2020
167 Acino 12/09/2001 03/06/2019
167 Acino 12/09/2001 04/31/2020
167 Acino 12/09/2001 09/03/2021

912 CRE 03/01/2012 03/03/2018
912 CRE 03/01/2012 05/06/2019
912 CRE 03/01/2012 09/06/2020
256 Auris 05/27/1983 08/05/2020

256 Auris 05/27/1983 12/07/2020

256 Auris 05/27/1983 10/07/2021
256 Auris 05/27/1983 02/07/2022
317 Acino 07/17/1985 12/07/2018
317 Acino 07/17/1985 01/03/2018

409 CRE 08/07/1987 03/03/2018
409 CRE 08/07/1987 05/06/2019
409 CRE 08/07/1987 09/06/2019

409 CRE 08/07/1987 10/06/2021
;;;;
run;

 

The result should be as this:

 

123 Auris 01/08/1961 01/01/2018
123 CRE 01/08/1961 09/02/2020

344 CRE 02/12/1956 08/06/2019
344 CRE 02/12/1956 03/03/2022
323 CRE 07/01/1993 01/06/2019
323 CRE 07/01/1993 09/31/2020
167 Acino 12/09/2001 03/06/2019
167 Acino 12/09/2001 09/03/2021

912 CRE 03/01/2012 03/03/2018
912 CRE 03/01/2012 05/06/2019
912 CRE 03/01/2012 09/06/2020

256 Auris 05/27/1983 08/05/2020

317 Acino 07/17/1985 12/07/2018
317 Acino 07/17/1985 01/03/2018

409 CRE 08/07/1987 03/03/2018
409 CRE 08/07/1987 05/06/2019
409 CRE 08/07/1987 10/06/2021

 

Please advise,

Thank you

 

10 REPLIES 10
mkeintz
PROC Star

Thanks for writing a DATA step to generate dataset HAVE.  But did you run it and read your log?  The data has two invalid dates: 9/31/2020 and 4/31/2020).

 

And to clarify "keep all obs that have a date difference of more than 12 months".  Do you mean gaps of more than three months between successive records for a CRE/DOB group?  Or do you mean more than 12 months from the DATE of the first record?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mayasak
Quartz | Level 8
Thank you for your reply. Yes, I noticed the dates. It should be 04/30/2020 and 09/30/2020, sorry about that.
What I meant is obs with "date" var. which is the testing date. So if the first test was done on 01/02/2019 it will be the first event. Any other test/s for CRE before 01/02/2020 should be removed and the first test will be kept only. If this same person had more tests after 01/02/2020, such as 02/07/2021 obs will be kept. If he had another one on 04/08/2021, we should keep this last test and keep 02/07/2021. Hope this clarify 🙂
mkeintz
PROC Star

What I meant is obs with "date" var. which is the testing date. So if the first test was done on 01/02/2019 it will be the first event. Any other test/s for CRE before 01/02/2020 should be removed and the first test will be kept only. If this same person had more tests after 01/02/2020, such as 02/07/2021 obs will be kept. If he had another one on 04/08/2021, we should keep this last test and keep 02/07/2021. 

Regarding your comment above that I put in bold italics - for ID 323 disease CRE your data has

323 CRE 07/01/1993 01/06/2019
323 CRE 07/01/1993 09/06/2020
323 CRE 07/01/1993 09/30/2020,  corrected from 09/31/2020

 

which based on your description suggests to me that you would keep the first date (01/06/2019) and the second and third dates, both of which fall after the one year mark (01/06/2020).

 

But your example expected result only has 01/06/2019 and 09/30/2020 (corrected from 09/31/2020).  Missing is 09/06/2020.  So is the expected result wrong, or is my understanding of your selection rule in error?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

Assuming in the case of disease='CRE' you want the first observation and all observations more than a year later, then:

 

data want  (where=(disease='CRE')  drop=_:);
  set have;
  by id disease dob notsorted;
  retain _cutoff_date;
  if first.dob then do;
     if disease='CRE' then _cutoff_date=intnx('year',date,1,'sameday');
     else _cutoff_date='31dec9999'd;
  end;

  if (disease='Auris' and first.dob=1)
  or (disease='Acino' and (first.dob=1 or last.dob=1))
  or (disease='CRE' and (first.dob=1 or date>_cutoff_date));
run;

This assumes that data for a given ID/DISEASE/DOB are grouped into consecutive observations.

 

Edit note: It also assumes that within a given ID/DISEASE/DOB group, the data are sorted by DATE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mayasak
Quartz | Level 8

Hi mkeintz,

Thank you for your reply. I ran the code but it did not work as intended. The result deleted all obs for diseases (Acino and Auris) other than CRE. Also for CRE, there are some obs that should've been deleted but they're still in the result (bold italic).

Thank you

This is what I got as a result:

Obs Id Disease DOB Date12345678910111213
123CRE01/08/196109/02/2020
344CRE02/12/195608/06/2019
344CRE02/12/195603/03/2022
323CRE07/01/199301/06/2019
323CRE07/01/199309/06/2020
323CRE07/01/199309/30/2020
912CRE03/01/201203/03/2018
912CRE03/01/201205/06/2019
912CRE03/01/201209/06/2020
409CRE08/07/198703/03/2018
409CRE08/07/198705/06/2019
409CRE08/07/198709/06/2019
409CRE08/07/198710/06/2021

 

mkeintz
PROC Star

Please take another look at my response.

 

I said

Assuming in the case of disease='CRE' you want the first observation and all observations more than a year later, then:

So my code was specitically referring to CRE  ("in the case of disease='CRE").  Notice in the first statement (the DATA statement), there is a filter to keep CRE data only.   Just remove that filter.

 

Second, I stated, based on your previous comments, that for CRE "you want the first observation and all observations more than a year later".  If that was wrong, then please restate your criterion for CRE cases, which I clearly do not understand.   Pointing out the observations that you don't want doesn't help me, since they fit perfectly my stated understanding.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mayasak
Quartz | Level 8

Hi mkeintz,

Thank you so much for the help. I got what you said about the CRE filtering. For the criterion, sorry for not making it clear. The criteria are as follows:

  • if we have 2 obs: we keep the first obs and remove the second if it's < 12 months from the first. else (>12 months difference) we keep it.
  • if we have 3 obs: we keep the first and remove the second and last if both are < 12 months from the first (ex 1/1/2018, 1/3/2018, 3/9/2018). Else, we keep the first and last if the last is >12 months from the first and the second is < 12 months from either the first or last such as #344. Else we keep the three of them if the second is > 12 months from the first and >12 months from the last such as # 912. 

Hope this is clear now.

Thank you.

Tom
Super User Tom
Super User

Are you looking to count distinct incidences of disease?  Is the one year window form the start of the incident? Or from the previous observation?  For example if you had one record per month for 2 years is that one incident or two?

mayasak
Quartz | Level 8

Hi Tom,

 

Hi Tom,

 

The one-year window is from the previous obs (test). For example,

  • if the person had an obs on 1/5/2019 (first, kept always).
  • All other obs that happened before 1/5/2020 should be removed.
  • If the second obs was on 6/8/2020, it should be kept if no other obs happened after it or if the third one is also 1 year apart from the second, such as on 7/9/2022.  In this case, we keep the three.
  • If the third one was on 7/9/2020, only the first (1/5/2019 and the last 7/9/2020 should be kept and the second obs 6/8/2020 should be removed.

Thank you.

mayasak
Quartz | Level 8

Hi,

How about working on deleting the obs using 2 steps. Meaning removing or retaining the second obs depending on the first step coding and then removing or retaining the third obs depending on a second step coding.

Thank you.

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1723 views
  • 0 likes
  • 3 in conversation