<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Duplicate data with different dates - keep the newest in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747962#M234845</link>
    <description>&lt;P&gt;In real life, I would expect that your data set contains additional variables.&amp;nbsp; If you want to keep the most recent value for all variables, it is possible that the most recent observation for a variable contains a missing value while an earlier observation contains a nonmissing value.&amp;nbsp; To keep the most recent non-missing value (possibly piecing together the final observation from several original observations), use:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
    by customerID createdate;
run;
data want;
    update have (obs=0) have;
    by customerID;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
    <pubDate>Mon, 14 Jun 2021 20:49:31 GMT</pubDate>
    <dc:creator>Astounding</dc:creator>
    <dc:date>2021-06-14T20:49:31Z</dc:date>
    <item>
      <title>Duplicate data with different dates - keep the newest</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747934#M234832</link>
      <description>&lt;P&gt;Assume I have data of the sort:&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;TABLE width="223"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="86"&gt;&lt;STRONG&gt;Customer ID&amp;nbsp;&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD width="57"&gt;&lt;STRONG&gt;Account&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD width="80"&gt;&lt;STRONG&gt;Create Date&lt;/STRONG&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;1&lt;/TD&gt;
&lt;TD&gt;123&lt;/TD&gt;
&lt;TD&gt;01-jan-20&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;1&lt;/TD&gt;
&lt;TD&gt;345&lt;/TD&gt;
&lt;TD&gt;02-jan-20&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;2&lt;/TD&gt;
&lt;TD&gt;300&lt;/TD&gt;
&lt;TD&gt;10-jan-20&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;&lt;STRONG&gt;&lt;EM&gt;1&lt;/EM&gt;&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;STRONG&gt;&lt;EM&gt;200&lt;/EM&gt;&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD&gt;&lt;STRONG&gt;&lt;EM&gt;05-jan-20&lt;/EM&gt;&lt;/STRONG&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Notice how I have three duplicates in terms of Customer_ID, I would only like to keep the last observation among these three duplicates since it is the newest observation. For example, 05-jan-20 is "younger" than (01-jan-20 and&amp;nbsp;02-jan-20) and therefore I want to keep the observation from 05-jan-20.&lt;BR /&gt;&lt;BR /&gt;Any advide on how I could do this?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Thanks.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 14 Jun 2021 19:32:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747934#M234832</guid>
      <dc:creator>SasStatistics</dc:creator>
      <dc:date>2021-06-14T19:32:37Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data with different dates - keep the newest</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747938#M234834</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
    by customerID createdate;
run;
data want;
    set have;
    by customerID;
    if last.customerID;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 14 Jun 2021 19:44:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747938#M234834</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2021-06-14T19:44:29Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data with different dates - keep the newest</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747941#M234836</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;&lt;BR /&gt;create&amp;nbsp;table&amp;nbsp;want&amp;nbsp;as&amp;nbsp;select&amp;nbsp;*&amp;nbsp;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;from&amp;nbsp;yourtable&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;group&amp;nbsp;by&amp;nbsp;customerID&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;having&amp;nbsp;createdate=max(createdate);&lt;BR /&gt;quit;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 14 Jun 2021 19:53:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747941#M234836</guid>
      <dc:creator>tarheel13</dc:creator>
      <dc:date>2021-06-14T19:53:34Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data with different dates - keep the newest</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747962#M234845</link>
      <description>&lt;P&gt;In real life, I would expect that your data set contains additional variables.&amp;nbsp; If you want to keep the most recent value for all variables, it is possible that the most recent observation for a variable contains a missing value while an earlier observation contains a nonmissing value.&amp;nbsp; To keep the most recent non-missing value (possibly piecing together the final observation from several original observations), use:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
    by customerID createdate;
run;
data want;
    update have (obs=0) have;
    by customerID;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 14 Jun 2021 20:49:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747962#M234845</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2021-06-14T20:49:31Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data with different dates - keep the newest</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747992#M234863</link>
      <description>&lt;P&gt;If sorting is expensive, here's an alternative:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc summary data=have nway;
  class customer_id;
  var create_date;
  output out=want (drop=_:) 
    max(create_date)=create_date 
    maxid(create_date(_all_))=;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Not only is pre-sorting unnecessary, but you could also ask for 1-way, 2-way, n-way classifications.&amp;nbsp; This would be handy if you need something like the most recent date not only for each customer, but also for (say) each product (which might otherwise require multiple pre-sorts).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The "&lt;EM&gt;&lt;STRONG&gt;maxid(create_date(_all_))=&lt;/STRONG&gt;&lt;/EM&gt;" expression tells proc summary to output the entire record (all the variables) containing the corresponding most-recent-date.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Edited comment:&amp;nbsp; &amp;nbsp;The program above will also generate these messages:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;WARNING: Variable customer_id already exists on file WORK.WANT.
WARNING: Variable create_date already exists on file WORK.WANT.
WARNING: The duplicate variables will not be included in the output data set of the output statement number 1.
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is due to the fact that the &lt;EM&gt;&lt;STRONG&gt;CLASS&lt;/STRONG&gt;&lt;/EM&gt; and &lt;EM&gt;&lt;STRONG&gt;VAR&lt;/STRONG&gt;&lt;/EM&gt; statements name variables (customer_id and create_date, respectively) that will be in the output dataset.&amp;nbsp; However, they are also implied in the "&lt;EM&gt;&lt;STRONG&gt;(_all_)&lt;/STRONG&gt;&lt;/EM&gt;" subparameter of the &lt;EM&gt;&lt;STRONG&gt;maxid&lt;/STRONG&gt;&lt;/EM&gt; parameter of the &lt;EM&gt;&lt;STRONG&gt;OUTPUT&lt;/STRONG&gt;&lt;/EM&gt; statement.&amp;nbsp; Now you could eliminate the warning for the create_date variable via:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc summary data=have nway;
  class customer_id;
  var create_date;
  output out=want (drop=_: ) 
    max(create_date)=_max_create_date 
    maxid(create_date(_all_))=;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Since the record for maxid(create_date) has the actual maximum value for create_date, renaming the generated maximum value statistic to "_max_create_date" (subsequently dropped) simply eliminates a duplicate value as well as the superfluous warning.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There may be a way to get rid of the remaining warning for the class variable, while using the "&lt;EM&gt;&lt;STRONG&gt;_all_&lt;/STRONG&gt;&lt;/EM&gt;" parameter, but it eludes me at the moment.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Jun 2021 18:42:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/747992#M234863</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2021-06-15T18:42:18Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data with different dates - keep the newest</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/748037#M234890</link>
      <description>&lt;P&gt;To add two more ways to solve the problem:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;/* 1) double sort */
proc sort data=have;
   by CustomerID descending CreateDate;
run;

proc sort data=have out=want nodupkey;
   by CustomerID;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;If proc sort seems to be to simple, you could use a hash-object:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data _null_;
   if 0 then set have;
   
   if _n_ = 1 then do;
      declare hash h();
      h.defineKey('CustomerID');
      h.defineData('CustomerID', 'Account', 'CreateDate');
      h.defineDone();
   end;
   
   set have(rename=(Account = acc CreateDate = cd)) end=jobDone;
   
   if h.find() ^= 0 then do;
      Account = acc;
      CreateDate = cd;
      h.add();
   end;
   else do;
      if CreateDate &amp;lt; cd then do;
         Account = acc;
         CreateDate = cd;
         h.update();
      end;
   end;
   
   if jobDone then do;
      h.output(dataset: 'work.want2');
   end;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 15 Jun 2021 06:21:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/748037#M234890</guid>
      <dc:creator>andreas_lds</dc:creator>
      <dc:date>2021-06-15T06:21:09Z</dc:date>
    </item>
    <item>
      <title>Re: Duplicate data with different dates - keep the newest</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/748040#M234892</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/381436"&gt;@SasStatistics&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You've got already solutions how to keep the row with the most recent date per customer id.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Assuming your sample data somehow represents your actual data: Isn't it possible that a customer can have more than one account? If so then wouldn't you need to keep the most recent row per customer_id AND account_id? ...and potentially also include some other column in your logic with an expiration date or account status?&lt;/P&gt;</description>
      <pubDate>Tue, 15 Jun 2021 06:48:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Duplicate-data-with-different-dates-keep-the-newest/m-p/748040#M234892</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2021-06-15T06:48:50Z</dc:date>
    </item>
  </channel>
</rss>

