<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Duplicate by matching 2 or more variables with a condition on another variable. in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811067#M319905</link>
    <description>&lt;P&gt;Hi mkeintz,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for the quick reply. I tried the code but it removed the observations which had missing values for the collection_date such as Minnie Mouse obs. What is needed is as follows:&amp;nbsp;&lt;/P&gt;&lt;P&gt;Remove the 4th obs with collection_date 5/11/2021 &amp;gt; 60 days than the first&amp;nbsp; collection_date of 12/21/2019.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;SSN&lt;/TD&gt;&lt;TD&gt;DOB&lt;/TD&gt;&lt;TD&gt;Disease&lt;/TD&gt;&lt;TD&gt;Name&lt;/TD&gt;&lt;TD&gt;Collection_Date&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;123456789&lt;/TD&gt;&lt;TD&gt;1/1/1900&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;John Doe&lt;/TD&gt;&lt;TD&gt;4/4/2019&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;123456789&lt;/TD&gt;&lt;TD&gt;1/1/1900&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;John Doe&lt;/TD&gt;&lt;TD&gt;4/5/2019&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;223344556&lt;/TD&gt;&lt;TD&gt;2/2/1901&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;Mickey Mouse&lt;/TD&gt;&lt;TD&gt;12/21/2019&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRIKE&gt;223344556&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRIKE&gt;2/2/1901&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRIKE&gt;Flu&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRIKE&gt;Mickie Mouse&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;FONT color="#000000"&gt;&lt;STRIKE&gt;5/11/2021&lt;/STRIKE&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;765432199&lt;/TD&gt;&lt;TD&gt;2/3/1903&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;Minnie Mouse&lt;/TD&gt;&lt;TD&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
    <pubDate>Mon, 02 May 2022 13:56:07 GMT</pubDate>
    <dc:creator>mayasak</dc:creator>
    <dc:date>2022-05-02T13:56:07Z</dc:date>
    <item>
      <title>Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/810986#M319854</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I need to deduplicate a dataset by two or more variables: "SSN", "name", "DOB".... with a condition on another variable, "specimen_collection_date" with a difference between the two dates of more than 60 days.&lt;/P&gt;&lt;P&gt;I'm using proc summary, but I still have to add the condition on the&amp;nbsp;"specimen_collection_date" variable.&lt;/P&gt;&lt;P&gt;proc summary data=want&amp;nbsp;;&lt;/P&gt;&lt;P&gt;class SSN DOB Name;&lt;/P&gt;&lt;P&gt;id SSN DOB Name;&lt;BR /&gt;output out=want_without_DupRecs;&lt;BR /&gt;run ;&lt;/P&gt;&lt;P&gt;Please, advise.&lt;/P&gt;&lt;P&gt;Thank you&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 02 May 2022 02:30:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/810986#M319854</guid>
      <dc:creator>mayasak</dc:creator>
      <dc:date>2022-05-02T02:30:29Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/810990#M319857</link>
      <description>&lt;P&gt;You could sort by SSN/DOB/NAME/collection_date.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then you could run a data step looking for matches on SSN/DOB/NAME and delete all instances with gaps of 60 days or less from the most recent "approved" observation.&amp;nbsp; Something like:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have out=have_sorted;
  by ssn dob name collection_date;
run;

data want (drop=_:);
  ** set have   -- replace this statement with next line **;
  set have_sorted;
  by ssn dob name;

  retain _maxdate;
  if first.name then call missing(_maxdate);

  if collection_date-60&amp;lt;=_maxdate then delete;
  _maxdate=collection_date;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 02 May 2022 15:01:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/810990#M319857</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2022-05-02T15:01:14Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811067#M319905</link>
      <description>&lt;P&gt;Hi mkeintz,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for the quick reply. I tried the code but it removed the observations which had missing values for the collection_date such as Minnie Mouse obs. What is needed is as follows:&amp;nbsp;&lt;/P&gt;&lt;P&gt;Remove the 4th obs with collection_date 5/11/2021 &amp;gt; 60 days than the first&amp;nbsp; collection_date of 12/21/2019.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;SSN&lt;/TD&gt;&lt;TD&gt;DOB&lt;/TD&gt;&lt;TD&gt;Disease&lt;/TD&gt;&lt;TD&gt;Name&lt;/TD&gt;&lt;TD&gt;Collection_Date&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;123456789&lt;/TD&gt;&lt;TD&gt;1/1/1900&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;John Doe&lt;/TD&gt;&lt;TD&gt;4/4/2019&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;123456789&lt;/TD&gt;&lt;TD&gt;1/1/1900&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;John Doe&lt;/TD&gt;&lt;TD&gt;4/5/2019&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;223344556&lt;/TD&gt;&lt;TD&gt;2/2/1901&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;Mickey Mouse&lt;/TD&gt;&lt;TD&gt;12/21/2019&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRIKE&gt;223344556&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRIKE&gt;2/2/1901&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRIKE&gt;Flu&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRIKE&gt;Mickie Mouse&lt;/STRIKE&gt;&lt;/TD&gt;&lt;TD&gt;&lt;FONT color="#000000"&gt;&lt;STRIKE&gt;5/11/2021&lt;/STRIKE&gt;&lt;/FONT&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;765432199&lt;/TD&gt;&lt;TD&gt;2/3/1903&lt;/TD&gt;&lt;TD&gt;Flu&lt;/TD&gt;&lt;TD&gt;Minnie Mouse&lt;/TD&gt;&lt;TD&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Mon, 02 May 2022 13:56:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811067#M319905</guid>
      <dc:creator>mayasak</dc:creator>
      <dc:date>2022-05-02T13:56:07Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811077#M319912</link>
      <description>&lt;P&gt;So now you need to refine the selection rules to accommodate the possibility of missing collection date.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;How do you want to treat instances of missing value in collection_date?&amp;nbsp; You've already implied that you want to keep them.&amp;nbsp; Should they enter into your subsequent evaluation of date gaps?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;And to get the most helpful suggestions, it would be strategic to provide sample data in the form of a working DATA step, as well as the expected output.&amp;nbsp; By doing this, you might be more likely to notice a missing value in collection_date, and provide guidance as to how you want to process it.&amp;nbsp; That was not previously apparent.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The other benefit to a sample data step is that respondents can actually test any suggested code.&amp;nbsp; That's what I try to do.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing).&amp;nbsp; Is this the actual condition of your data?&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Additional questions:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days.&amp;nbsp; But your example suggests the opposite.&amp;nbsp; Do you want to remove instances of MORE THAN 60 days?&amp;nbsp; So what is the rule when you have two separate date ranges of less than 60 days.&amp;nbsp; I.e., what if your have a given individuals with collection dates&amp;nbsp; &amp;nbsp;on DAYS 1, 5&amp;nbsp; and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 days.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?&lt;/P&gt;</description>
      <pubDate>Mon, 02 May 2022 14:59:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811077#M319912</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2022-05-02T14:59:25Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811123#M319927</link>
      <description>&lt;P&gt;I notice in the sample table you provided that you have "Mickey Mouse" followed by "Mickie Mouse".&amp;nbsp; Is that a typographical error on your part?&amp;nbsp; Or is this to be considered a "near match" subject to the date range constraints?&amp;nbsp; If so, then you'll need to provide matching rules for SSN/DOB/Name that would have to precede consideration of the 60 day rule.&lt;/P&gt;</description>
      <pubDate>Mon, 02 May 2022 19:10:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811123#M319927</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2022-05-02T19:10:10Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811598#M320133</link>
      <description>&lt;P&gt;Hi mkeintz,&lt;/P&gt;&lt;P&gt;You have a very eye-opening discussion. I really appreciate it. Here are my answers to the questions you asked:&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;How do you want to treat instances of missing value in collection_date?&amp;nbsp; You've already implied that you want to keep them.&amp;nbsp; Should they enter into your subsequent evaluation of date gaps?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;If a collection date is missing I'm gonna use Episode_date or result_date instead. If all data are missing, the case should be deleted.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000"&gt;&lt;SPAN&gt;Also, your sample data seems to be already ordered by collection_date (except when collection_date is missing).&amp;nbsp; Is this the actual condition of your data?&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;No, that is not the actual condition of the data.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;Also, when reading your original post I thought you wanted to remove instances with gaps of LESS THAN 60 days.&amp;nbsp; But your example suggests the opposite.&amp;nbsp; Do you want to remove instances of MORE THAN 60 days?&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Are you just saying to keep the earliest date plus all dates within 60 days of it (plus missing dates)?&lt;/P&gt;&lt;P&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;Sorry, for the confusion. My fault.&amp;nbsp;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;The rules for removing data are:&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;&lt;FONT color="#FF0000"&gt;The same patient with two different diseases -no matter what collection_date is, both events will be retained.&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;&lt;FONT color="#FF0000"&gt;Same patient with the same disease and different collection_date, the first event will be retained according to the date difference between the two "collection_date" and the "disease":&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;&lt;FONT color="#FF0000"&gt;For example, &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;&lt;FONT color="#FF0000"&gt;&lt;U&gt;C-Auris&lt;/U&gt; is one event for the whole life, so for any difference, &lt;U&gt;only&lt;/U&gt; the first case would be retained. For&lt;U&gt; Acinetobacter&lt;/U&gt;, on the other hand, if the difference is more than 60 days, both events will be retained (if less the second event will be deleted). For &lt;U&gt;CRE&lt;/U&gt;, the difference is more than 90 days instead of 60.&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;&lt;FONT color="#FF0000"&gt;If a patient has 3 collection dates, the first event will be retained. The second event will be retained according to the disease rule above. The third event will be retained if its collection_date difference from the second is more than 45 or 60 days, if less, it will be removed.&amp;nbsp;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT color="#000000"&gt;&lt;SPAN&gt;So what is the rule when you have two separate date ranges of less than 60 days.&amp;nbsp; I.e., what if your have a given individuals with collection dates&amp;nbsp; &amp;nbsp;on DAYS 1, 5&amp;nbsp; and DAYS 101,105. These dates have two gaps of 4 days and 1 gap of 96 da&lt;/SPAN&gt;&lt;/FONT&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;ys.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#FF0000"&gt;&lt;SPAN&gt;In this case, only events on days 1 and 105 will be retained.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000"&gt;&lt;SPAN&gt;I've attached an excel sheet with a sample of the data. Not sure if this is what you asked for!&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#000000"&gt;&lt;SPAN&gt;Thank you&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 05 May 2022 02:21:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811598#M320133</guid>
      <dc:creator>mayasak</dc:creator>
      <dc:date>2022-05-05T02:21:05Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811604#M320138</link>
      <description>&lt;P&gt;When you provide a csv file, you must also provide the code to read it.&lt;/P&gt;</description>
      <pubDate>Thu, 05 May 2022 05:11:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811604#M320138</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2022-05-05T05:11:25Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate by matching 2 or more variables with a condition on another variable.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811945#M320319</link>
      <description>&lt;P&gt;Hi mkeintz,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yes, it is a&amp;nbsp;&lt;SPAN&gt;typographical error on my part. But, thinking of what you've mentioned about having matching rules, it definitely should be considered in my second step of merging these data with other sources of data.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;For the data sample that I've attached, I noticed that the dates are all formatted in a weird way. I tried entering data using datalines but I had similar issues as well! Is there a specific way to enter dates in datalines?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Razina&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 06 May 2022 22:00:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-by-matching-2-or-more-variables-with-a-condition-on/m-p/811945#M320319</guid>
      <dc:creator>mayasak</dc:creator>
      <dc:date>2022-05-06T22:00:19Z</dc:date>
    </item>
  </channel>
</rss>

