Duplicate by matching 2 or more variables with a condition on another ...

mayasak · Posted 05-01-2022 10:30 PM

Hi,

I need to deduplicate a dataset by two or more variables: "SSN", "name", "DOB".... with a condition on another variable, "specimen_collection_date" with a difference between the two dates of more than 60 days.

I'm using proc summary, but I still have to add the condition on the "specimen_collection_date" variable.

proc summary data=want ;

class SSN DOB Name;

id SSN DOB Name;
output out=want_without_DupRecs;
run ;

Please, advise.

Thank you

mkeintz · Posted 05-01-2022 11:36 PM

You could sort by SSN/DOB/NAME/collection_date.

Then you could run a data step looking for matches on SSN/DOB/NAME and delete all instances with gaps of 60 days or less from the most recent "approved" observation. Something like:

proc sort data=have out=have_sorted;
  by ssn dob name collection_date;
run;

data want (drop=_:);
  ** set have   -- replace this statement with next line **;
  set have_sorted;
  by ssn dob name;

  retain _maxdate;
  if first.name then call missing(_maxdate);

  if collection_date-60<=_maxdate then delete;
  _maxdate=collection_date;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mayasak · Posted 05-02-2022 09:56 AM

Hi mkeintz,

Thank you for the quick reply. I tried the code but it removed the observations which had missing values for the collection_date such as Minnie Mouse obs. What is needed is as follows:

Remove the 4th obs with collection_date 5/11/2021 > 60 days than the first collection_date of 12/21/2019.

SSN	DOB	Disease	Name	Collection_Date
123456789	1/1/1900	Flu	John Doe	4/4/2019
123456789	1/1/1900	Flu	John Doe	4/5/2019
223344556	2/2/1901	Flu	Mickey Mouse	12/21/2019
~~223344556~~	~~2/2/1901~~	~~Flu~~	~~Mickie Mouse~~	~~5/11/2021~~
765432199	2/3/1903	Flu	Minnie Mouse

Thank you

mkeintz · Posted 05-02-2022 10:50 AM

So now you need to refine the selection rules to accommodate the possibility of missing collection date.

How do you want to treat instances of missing value in collection_date? You've already implied that you want to keep them. Should they enter into your subsequent evaluation of date gaps?

And to get the most helpful suggestions, it would be strategic to provide sample data in the form of a working DATA step, as well as the expected output. By doing this, you might be more likely to notice a missing value in collection_date, and provide guidance as to how you want to process it. That was not previously apparent.

The other benefit to a sample data step is that respondents can actually test any suggested code. That's what I try to do.

Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing). Is this the actual condition of your data?

Additional questions:

Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days. But your example suggests the opposite. Do you want to remove instances of MORE THAN 60 days? So what is the rule when you have two separate date ranges of less than 60 days. I.e., what if your have a given individuals with collection dates on DAYS 1, 5 and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 days.

Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mayasak · Posted 05-04-2022 10:21 PM

Hi mkeintz,

You have a very eye-opening discussion. I really appreciate it. Here are my answers to the questions you asked:

How do you want to treat instances of missing value in collection_date? You've already implied that you want to keep them. Should they enter into your subsequent evaluation of date gaps?

If a collection date is missing I'm gonna use Episode_date or result_date instead. If all data are missing, the case should be deleted.

Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing). Is this the actual condition of your data?

No, that is not the actual condition of the data.

Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days. But your example suggests the opposite. Do you want to remove instances of MORE THAN 60 days?

Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?

Sorry, for the confusion. My fault.

The rules for removing data are:

The same patient with two different diseases -no matter what collection_date is, both events will be retained.
Same patient with the same disease and different collection_date, the first event will be retained according to the date difference between the two "collection_date" and the "disease": For example, C-Auris is one event for the whole life, so for any difference, only the first case would be retained. For Acinetobacter, on the other hand, if the difference is more than 60 days, both events will be retained (if less the second event will be deleted). For CRE, the difference is more than 90 days instead of 60.
If a patient has 3 collection dates, the first event will be retained. The second event will be retained according to the disease rule above. The third event will be retained if its collection_date difference from the second is more than 45 or 60 days, if less, it will be removed.

So what is the rule when you have two separate date ranges of less than 60 days. I.e., what if your have a given individuals with collection dates on DAYS 1, 5 and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 da.ys.

In this case, only events on days 1 and 105 will be retained.

I've attached an excel sheet with a sample of the data. Not sure if this is what you asked for!

Thank you

Kurt_Bremser · Posted 05-05-2022 01:11 AM

When you provide a csv file, you must also provide the code to read it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 05-02-2022 03:10 PM

I notice in the sample table you provided that you have "Mickey Mouse" followed by "Mickie Mouse". Is that a typographical error on your part? Or is this to be considered a "near match" subject to the date range constraints? If so, then you'll need to provide matching rules for SSN/DOB/Name that would have to precede consideration of the 60 day rule.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mayasak · Posted 05-06-2022 06:00 PM

Hi mkeintz,

Yes, it is a typographical error on my part. But, thinking of what you've mentioned about having matching rules, it definitely should be considered in my second step of merging these data with other sources of data.

For the data sample that I've attached, I noticed that the dates are all formatted in a weird way. I tried entering data using datalines but I had similar issues as well! Is there a specific way to enter dates in datalines?

Thank you,

Razina

Duplicate by matching 2 or more variables with a condition on another variable.

Re: Duplicate by matching 2 or more variables with a condition on another variable.

Re: Duplicate by matching 2 or more variables with a condition on another variable.

Re: Duplicate by matching 2 or more variables with a condition on another variable.

Re: Duplicate by matching 2 or more variables with a condition on another variable.

Re: Duplicate by matching 2 or more variables with a condition on another variable.

Re: Duplicate by matching 2 or more variables with a condition on another variable.

Re: Duplicate by matching 2 or more variables with a condition on another variable.

Registration is open

Registration is open

SAS Training: Just a Click Away