Hi,
I need to deduplicate a dataset by two or more variables: "SSN", "name", "DOB".... with a condition on another variable, "specimen_collection_date" with a difference between the two dates of more than 60 days.
I'm using proc summary, but I still have to add the condition on the "specimen_collection_date" variable.
proc summary data=want ;
class SSN DOB Name;
id SSN DOB Name;
output out=want_without_DupRecs;
run ;
Please, advise.
Thank you
You could sort by SSN/DOB/NAME/collection_date.
Then you could run a data step looking for matches on SSN/DOB/NAME and delete all instances with gaps of 60 days or less from the most recent "approved" observation. Something like:
proc sort data=have out=have_sorted;
by ssn dob name collection_date;
run;
data want (drop=_:);
** set have -- replace this statement with next line **;
set have_sorted;
by ssn dob name;
retain _maxdate;
if first.name then call missing(_maxdate);
if collection_date-60<=_maxdate then delete;
_maxdate=collection_date;
run;
Hi mkeintz,
Thank you for the quick reply. I tried the code but it removed the observations which had missing values for the collection_date such as Minnie Mouse obs. What is needed is as follows:
Remove the 4th obs with collection_date 5/11/2021 > 60 days than the first collection_date of 12/21/2019.
SSN | DOB | Disease | Name | Collection_Date |
123456789 | 1/1/1900 | Flu | John Doe | 4/4/2019 |
123456789 | 1/1/1900 | Flu | John Doe | 4/5/2019 |
223344556 | 2/2/1901 | Flu | Mickey Mouse | 12/21/2019 |
765432199 | 2/3/1903 | Flu | Minnie Mouse |
Thank you
So now you need to refine the selection rules to accommodate the possibility of missing collection date.
How do you want to treat instances of missing value in collection_date? You've already implied that you want to keep them. Should they enter into your subsequent evaluation of date gaps?
And to get the most helpful suggestions, it would be strategic to provide sample data in the form of a working DATA step, as well as the expected output. By doing this, you might be more likely to notice a missing value in collection_date, and provide guidance as to how you want to process it. That was not previously apparent.
The other benefit to a sample data step is that respondents can actually test any suggested code. That's what I try to do.
Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing). Is this the actual condition of your data?
Additional questions:
Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days. But your example suggests the opposite. Do you want to remove instances of MORE THAN 60 days? So what is the rule when you have two separate date ranges of less than 60 days. I.e., what if your have a given individuals with collection dates on DAYS 1, 5 and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 days.
Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?
Hi mkeintz,
You have a very eye-opening discussion. I really appreciate it. Here are my answers to the questions you asked:
How do you want to treat instances of missing value in collection_date? You've already implied that you want to keep them. Should they enter into your subsequent evaluation of date gaps?
If a collection date is missing I'm gonna use Episode_date or result_date instead. If all data are missing, the case should be deleted.
Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing). Is this the actual condition of your data?
No, that is not the actual condition of the data.
Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days. But your example suggests the opposite. Do you want to remove instances of MORE THAN 60 days?
Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?
Sorry, for the confusion. My fault.
The rules for removing data are:
So what is the rule when you have two separate date ranges of less than 60 days. I.e., what if your have a given individuals with collection dates on DAYS 1, 5 and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 da.ys.
In this case, only events on days 1 and 105 will be retained.
I've attached an excel sheet with a sample of the data. Not sure if this is what you asked for!
Thank you
When you provide a csv file, you must also provide the code to read it.
I notice in the sample table you provided that you have "Mickey Mouse" followed by "Mickie Mouse". Is that a typographical error on your part? Or is this to be considered a "near match" subject to the date range constraints? If so, then you'll need to provide matching rules for SSN/DOB/Name that would have to precede consideration of the 60 day rule.
Hi mkeintz,
Yes, it is a typographical error on my part. But, thinking of what you've mentioned about having matching rules, it definitely should be considered in my second step of merging these data with other sources of data.
For the data sample that I've attached, I noticed that the dates are all formatted in a weird way. I tried entering data using datalines but I had similar issues as well! Is there a specific way to enter dates in datalines?
Thank you,
Razina
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.