BookmarkSubscribeRSS Feed
mayasak
Quartz | Level 8

Hi,

I need to deduplicate a dataset by two or more variables: "SSN", "name", "DOB".... with a condition on another variable, "specimen_collection_date" with a difference between the two dates of more than 60 days.

I'm using proc summary, but I still have to add the condition on the "specimen_collection_date" variable.

proc summary data=want ;

class SSN DOB Name;

id SSN DOB Name;
output out=want_without_DupRecs;
run ;

Please, advise.

Thank you

7 REPLIES 7
mkeintz
PROC Star

You could sort by SSN/DOB/NAME/collection_date.

 

Then you could run a data step looking for matches on SSN/DOB/NAME and delete all instances with gaps of 60 days or less from the most recent "approved" observation.  Something like:

 

proc sort data=have out=have_sorted;
  by ssn dob name collection_date;
run;

data want (drop=_:);
  ** set have   -- replace this statement with next line **;
  set have_sorted;
  by ssn dob name;

  retain _maxdate;
  if first.name then call missing(_maxdate);

  if collection_date-60<=_maxdate then delete;
  _maxdate=collection_date;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mayasak
Quartz | Level 8

Hi mkeintz,

 

Thank you for the quick reply. I tried the code but it removed the observations which had missing values for the collection_date such as Minnie Mouse obs. What is needed is as follows: 

Remove the 4th obs with collection_date 5/11/2021 > 60 days than the first  collection_date of 12/21/2019.

 

SSNDOBDiseaseNameCollection_Date
1234567891/1/1900FluJohn Doe4/4/2019
1234567891/1/1900FluJohn Doe4/5/2019
2233445562/2/1901FluMickey Mouse12/21/2019
2233445562/2/1901FluMickie Mouse5/11/2021
7654321992/3/1903FluMinnie Mouse 

 

Thank you

mkeintz
PROC Star

So now you need to refine the selection rules to accommodate the possibility of missing collection date.

 

How do you want to treat instances of missing value in collection_date?  You've already implied that you want to keep them.  Should they enter into your subsequent evaluation of date gaps?

 

And to get the most helpful suggestions, it would be strategic to provide sample data in the form of a working DATA step, as well as the expected output.  By doing this, you might be more likely to notice a missing value in collection_date, and provide guidance as to how you want to process it.  That was not previously apparent.

 

The other benefit to a sample data step is that respondents can actually test any suggested code.  That's what I try to do.

 

Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing).  Is this the actual condition of your data?  

 

 

Additional questions:

 

Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days.  But your example suggests the opposite.  Do you want to remove instances of MORE THAN 60 days?  So what is the rule when you have two separate date ranges of less than 60 days.  I.e., what if your have a given individuals with collection dates   on DAYS 1, 5  and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 days.

 

Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mayasak
Quartz | Level 8

Hi mkeintz,

You have a very eye-opening discussion. I really appreciate it. Here are my answers to the questions you asked:

How do you want to treat instances of missing value in collection_date?  You've already implied that you want to keep them.  Should they enter into your subsequent evaluation of date gaps?

If a collection date is missing I'm gonna use Episode_date or result_date instead. If all data are missing, the case should be deleted.

Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing).  Is this the actual condition of your data?  

No, that is not the actual condition of the data.

Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days.  But your example suggests the opposite.  Do you want to remove instances of MORE THAN 60 days?  

 

Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?

Sorry, for the confusion. My fault. 

The rules for removing data are:

  • The same patient with two different diseases -no matter what collection_date is, both events will be retained.
  • Same patient with the same disease and different collection_date, the first event will be retained according to the date difference between the two "collection_date" and the "disease": For example, C-Auris is one event for the whole life, so for any difference, only the first case would be retained. For Acinetobacter, on the other hand, if the difference is more than 60 days, both events will be retained (if less the second event will be deleted). For CRE, the difference is more than 90 days instead of 60.
  • If a patient has 3 collection dates, the first event will be retained. The second event will be retained according to the disease rule above. The third event will be retained if its collection_date difference from the second is more than 45 or 60 days, if less, it will be removed. 

So what is the rule when you have two separate date ranges of less than 60 days.  I.e., what if your have a given individuals with collection dates   on DAYS 1, 5  and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 da.ys.

In this case, only events on days 1 and 105 will be retained.

I've attached an excel sheet with a sample of the data. Not sure if this is what you asked for!

Thank you

 

 

mkeintz
PROC Star

I notice in the sample table you provided that you have "Mickey Mouse" followed by "Mickie Mouse".  Is that a typographical error on your part?  Or is this to be considered a "near match" subject to the date range constraints?  If so, then you'll need to provide matching rules for SSN/DOB/Name that would have to precede consideration of the 60 day rule.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mayasak
Quartz | Level 8

Hi mkeintz,

 

Yes, it is a typographical error on my part. But, thinking of what you've mentioned about having matching rules, it definitely should be considered in my second step of merging these data with other sources of data.

For the data sample that I've attached, I noticed that the dates are all formatted in a weird way. I tried entering data using datalines but I had similar issues as well! Is there a specific way to enter dates in datalines?

 

Thank you,

Razina

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1707 views
  • 0 likes
  • 3 in conversation