<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Remove duplicate observations based on multiple criteria in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Remove-duplicate-observations-based-on-multiple-criteria/m-p/637875#M189658</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/289103"&gt;@RoddyJ&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;(I assumed that the "NUL" values are missing values in your real data.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edit:&lt;/P&gt;
&lt;P&gt;Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation &lt;EM&gt;whenever possible&lt;/EM&gt;&amp;nbsp;-- even if there was another observation satisfying &lt;EM&gt;both&lt;/EM&gt; of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well.&amp;nbsp;(Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "&lt;FONT face="courier new,courier"&gt;if last.id;&lt;/FONT&gt;" in the DATA step implements the selection&amp;nbsp;rule as described.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences &lt;FONT face="courier new,courier"&gt;abs(var2-5)&lt;/FONT&gt; and &lt;FONT face="courier new,courier"&gt;abs(var1-2)&lt;/FONT&gt;&amp;nbsp;(or squared differences, ...) in the sorting rule so that, say,&amp;nbsp;&lt;FONT face="courier new,courier"&gt;var2=5.1&lt;/FONT&gt; would be regarded as a better match than &lt;FONT face="courier new,courier"&gt;var2=9.0&lt;/FONT&gt;, everything else being the same. You would need to decide about the priorities, e.g., if &lt;FONT face="courier new,courier"&gt;var1=2&lt;/FONT&gt; (exact match in the third-priority criterion) should trump a close, but not exact match in &lt;FONT face="courier new,courier"&gt;var2&lt;/FONT&gt; (&lt;FONT face="courier new,courier"&gt;5.03&lt;/FONT&gt;, say). The implementation in PROC SQL would be fairly easy.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.&lt;/P&gt;
&lt;PRE&gt;priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 06 Apr 2020 19:31:44 GMT</pubDate>
    <dc:creator>FreelanceReinh</dc:creator>
    <dc:date>2020-04-06T19:31:44Z</dc:date>
    <item>
      <title>Remove duplicate observations based on multiple criteria</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Remove-duplicate-observations-based-on-multiple-criteria/m-p/637865#M189653</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a dataset of the form:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;ID &amp;nbsp; &amp;nbsp;var1 &amp;nbsp; &amp;nbsp;var2 &amp;nbsp; &amp;nbsp;var3&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; 1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;11&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; 2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 4 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;12&lt;/P&gt;&lt;P&gt;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; 2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 5 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;NUL&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;5 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;13&lt;/P&gt;&lt;P&gt;2 &amp;nbsp; &amp;nbsp; &amp;nbsp;2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;5 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; NUL&lt;/P&gt;&lt;P&gt;3 &amp;nbsp; &amp;nbsp; &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;5 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; NUL&lt;/P&gt;&lt;P&gt;3 &amp;nbsp; &amp;nbsp; &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;4 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; NUL&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'd like to remove duplicate observations of ID based on multiple criteria.&lt;/P&gt;&lt;P&gt;For example: var3 should be non-null, var2 should be 5 and var1 should be 2. But if those criteria can't be met I want to keep the observation that best fits those criteria. e.g in the table given, I'd want to keep the 2nd observation with ID = 1 because var1 = 2 and var is non-null. I'd want to keep the first observation of ID=2 because var2 = 5 and var3 is non-null and I'd want to keep the first observation of ID=3 because var2 = 5.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is this possible to solve with a SAS program, or is the original dataset too messy?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks in advance for your help!&lt;/P&gt;</description>
      <pubDate>Mon, 06 Apr 2020 15:16:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Remove-duplicate-observations-based-on-multiple-criteria/m-p/637865#M189653</guid>
      <dc:creator>RoddyJ</dc:creator>
      <dc:date>2020-04-06T15:16:41Z</dc:date>
    </item>
    <item>
      <title>Re: Remove duplicate observations based on multiple criteria</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Remove-duplicate-observations-based-on-multiple-criteria/m-p/637875#M189658</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/289103"&gt;@RoddyJ&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can sort the input dataset using an ORDER BY clause with PROC SQL (into a view) and then use BY-group processing in a DATA step to select one observation per ID. The ORDER BY clause supports a wide range of sorting rules. Here's an example implementing a hierarchical rule as suggested by your description:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc format;
invalue nulmiss
'NUL' = .;
run;

data have;
input ID var1 var2 var3 :nulmiss.;
cards;
1 1 3 11
1 2 4 12
1 2 5 NUL
2 1 5 13
2 2 5 NUL
3 1 5 NUL
3 1 4 NUL
;

proc sql;
create view _tmp as
select * from have
order by id, not missing(var3), var2=5, var1=2;
quit;

data want;
set _tmp;
by id;
if last.id;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;(I assumed that the "NUL" values are missing values in your real data.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Similarly, you could compute a numeric score and sort by this score -- all within the ORDER BY clause.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edit:&lt;/P&gt;
&lt;P&gt;Explanation of the "hierarchical rule": The first criterion, "var3 not missing," has top priority (within the ID BY group), i.e., it will be met for the selected observation &lt;EM&gt;whenever possible&lt;/EM&gt;&amp;nbsp;-- even if there was another observation satisfying &lt;EM&gt;both&lt;/EM&gt; of the other two criteria, but not the first. The next criterion, "var2=5," has second priority, i.e., it will be regarded only if the first criterion alone does not determine a unique observation to be selected from a particular ID BY group. Of course, in this situation a record satisfying it would be preferred in the selection. If there are still two or more "selection candidates" in a BY group with the same status regarding the first two criteria, the third criterion, "var1=2," will be used in the decision. It's possible (but doesn't occur in your sample data) that these three criteria do not uniquely determine an observation in a BY group. The suggested code would then select an arbitrary observation from those satisfying the three criteria equally well.&amp;nbsp;(Edit 3: Needless to say, the last three items in the ORDER BY clause are Boolean expressions. Their sort order is always 0, 1 [that is: FALSE, TRUE], which in conjunction with the subsetting "&lt;FONT face="courier new,courier"&gt;if last.id;&lt;/FONT&gt;" in the DATA step implements the selection&amp;nbsp;rule as described.)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Depending on the meaning of the data it may be appropriate to include, for example, the absolute differences &lt;FONT face="courier new,courier"&gt;abs(var2-5)&lt;/FONT&gt; and &lt;FONT face="courier new,courier"&gt;abs(var1-2)&lt;/FONT&gt;&amp;nbsp;(or squared differences, ...) in the sorting rule so that, say,&amp;nbsp;&lt;FONT face="courier new,courier"&gt;var2=5.1&lt;/FONT&gt; would be regarded as a better match than &lt;FONT face="courier new,courier"&gt;var2=9.0&lt;/FONT&gt;, everything else being the same. You would need to decide about the priorities, e.g., if &lt;FONT face="courier new,courier"&gt;var1=2&lt;/FONT&gt; (exact match in the third-priority criterion) should trump a close, but not exact match in &lt;FONT face="courier new,courier"&gt;var2&lt;/FONT&gt; (&lt;FONT face="courier new,courier"&gt;5.03&lt;/FONT&gt;, say). The implementation in PROC SQL would be fairly easy.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edit 2: Another way to understand the hierarchical rule is to consider the "ranking" of the eight possible cases that can be distinguished by the three binary criteria. Each observation of a BY group falls into one of eight categories and an observation with the "highest" available priority (1=top, ..., 8=lowest priority) is selected.&lt;/P&gt;
&lt;PRE&gt;priority var1 var2 var3
    1      2    5  non-missing
    2     ≠2    5  non-missing
    3      2   ≠5  non-missing
    4     ≠2   ≠5  non-missing
    5      2    5  missing
    6     ≠2    5  missing
    7      2   ≠5  missing
    8     ≠2   ≠5  missing&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Apr 2020 19:31:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Remove-duplicate-observations-based-on-multiple-criteria/m-p/637875#M189658</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2020-04-06T19:31:44Z</dc:date>
    </item>
    <item>
      <title>Re: Remove duplicate observations based on multiple criteria</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Remove-duplicate-observations-based-on-multiple-criteria/m-p/637915#M189672</link>
      <description>Thanks for this reply, really well explained!</description>
      <pubDate>Mon, 06 Apr 2020 18:38:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Remove-duplicate-observations-based-on-multiple-criteria/m-p/637915#M189672</guid>
      <dc:creator>RoddyJ</dc:creator>
      <dc:date>2020-04-06T18:38:53Z</dc:date>
    </item>
  </channel>
</rss>

