<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Deleting ALL duplicate records in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43358#M11308</link>
    <description>Hi there&lt;BR /&gt;
&lt;BR /&gt;
Firstly apologies if this question is in the wrong place.&lt;BR /&gt;
&lt;BR /&gt;
Secondly, more apologies for being such a newbie and having to ask about such basic things!&lt;BR /&gt;
&lt;BR /&gt;
and finally.....&lt;BR /&gt;
A bit of a theoretical question for you all.&lt;BR /&gt;
How would you go about deleting &lt;U&gt;&lt;I&gt;&lt;B&gt;all&lt;/B&gt;&lt;/I&gt;&lt;/U&gt; duplicate observations from a dataset??&lt;BR /&gt;
&lt;BR /&gt;
eg: I have two datasets containing the same information, downloaded a month apart. The latest dataset contains all the observations which appear in the earlier set plus a few dozen new observations. &lt;BR /&gt;
&lt;BR /&gt;
I know NODUP and NODUPKEY only write the first occurrence of a duplicate to the set you’re creating. That’s great, but I want to remove &lt;I&gt;&lt;B&gt;both&lt;/B&gt;&lt;/I&gt; occurrences of the duplicates, leaving me with a dataset that contains &lt;I&gt;&lt;B&gt;only the new observations.&lt;/B&gt;&lt;/I&gt;&lt;BR /&gt;
&lt;BR /&gt;
Any useful tips would be much appreciated.&lt;BR /&gt;
&lt;BR /&gt;
Thanks in advance</description>
    <pubDate>Mon, 04 Apr 2011 04:58:02 GMT</pubDate>
    <dc:creator>deleted_user</dc:creator>
    <dc:date>2011-04-04T04:58:02Z</dc:date>
    <item>
      <title>Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43358#M11308</link>
      <description>Hi there&lt;BR /&gt;
&lt;BR /&gt;
Firstly apologies if this question is in the wrong place.&lt;BR /&gt;
&lt;BR /&gt;
Secondly, more apologies for being such a newbie and having to ask about such basic things!&lt;BR /&gt;
&lt;BR /&gt;
and finally.....&lt;BR /&gt;
A bit of a theoretical question for you all.&lt;BR /&gt;
How would you go about deleting &lt;U&gt;&lt;I&gt;&lt;B&gt;all&lt;/B&gt;&lt;/I&gt;&lt;/U&gt; duplicate observations from a dataset??&lt;BR /&gt;
&lt;BR /&gt;
eg: I have two datasets containing the same information, downloaded a month apart. The latest dataset contains all the observations which appear in the earlier set plus a few dozen new observations. &lt;BR /&gt;
&lt;BR /&gt;
I know NODUP and NODUPKEY only write the first occurrence of a duplicate to the set you’re creating. That’s great, but I want to remove &lt;I&gt;&lt;B&gt;both&lt;/B&gt;&lt;/I&gt; occurrences of the duplicates, leaving me with a dataset that contains &lt;I&gt;&lt;B&gt;only the new observations.&lt;/B&gt;&lt;/I&gt;&lt;BR /&gt;
&lt;BR /&gt;
Any useful tips would be much appreciated.&lt;BR /&gt;
&lt;BR /&gt;
Thanks in advance</description>
      <pubDate>Mon, 04 Apr 2011 04:58:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43358#M11308</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2011-04-04T04:58:02Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43359#M11309</link>
      <description>proc sql;&lt;BR /&gt;
create table tabB as&lt;BR /&gt;
select obs from tabA&lt;BR /&gt;
group by obs having count(*)&amp;gt;1;&lt;BR /&gt;
&lt;BR /&gt;
delete from tabA where obs in (select obs from tabB);&lt;BR /&gt;
quit;</description>
      <pubDate>Mon, 04 Apr 2011 11:50:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43359#M11309</guid>
      <dc:creator>DBailey</dc:creator>
      <dc:date>2011-04-04T11:50:31Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43360#M11310</link>
      <description>I wonder if it achieves the intended goal. Adre the remaing obs all new ones? I'd like to know how to delete extra dup using SQL. Thanks in advance.</description>
      <pubDate>Tue, 05 Apr 2011 16:14:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43360#M11310</guid>
      <dc:creator>buckeye</dc:creator>
      <dc:date>2011-04-05T16:14:47Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43361#M11311</link>
      <description>How would you distinguish "..only the new observation"?  If there is some event-date/time variable, you can sort the file with your "base" BY variable list along with DESCENDING &lt;VARNAME&gt; -- but don't code NODUPKEY in the first sort.&lt;BR /&gt;
&lt;BR /&gt;
Then issue a second SORT using NODUPKEY along with the EQUALS parameter while only providing your "base" BY variable list.&lt;BR /&gt;
&lt;BR /&gt;
The other alternative using a DATA step is to ensure that your input file gets sorted into the desired sequence, then use a DATA step with a SET and a BY statement with your "base" BY variable list.  And in an IF &lt;EXPRESSION&gt;/THEN OUTPUT; statement, use the LAST.&lt;BREAK_VARIABLE&gt;  technique so that only the last-occurrence of any "by variable group" of observations (presumably the newest obs?) will be output.&lt;BR /&gt;
&lt;BR /&gt;
Scott Barry&lt;BR /&gt;
SBBWorks, Inc.&lt;BR /&gt;
&lt;BR /&gt;
Suggested Google advanced search arguments, this topic / post:&lt;BR /&gt;
&lt;BR /&gt;
proc sort nodupkey site:sas.com&lt;BR /&gt;
&lt;BR /&gt;
data step by group processing site:sas.com&lt;BR /&gt;
&lt;BR /&gt;
data step remove duplicates site:sas.com&lt;/BREAK_VARIABLE&gt;&lt;/EXPRESSION&gt;&lt;/VARNAME&gt;</description>
      <pubDate>Tue, 05 Apr 2011 18:11:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43361#M11311</guid>
      <dc:creator>sbb</dc:creator>
      <dc:date>2011-04-05T18:11:09Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43362#M11312</link>
      <description>I think people might be overthinking this.  How about:&lt;BR /&gt;
&lt;BR /&gt;
[pre]&lt;BR /&gt;
proc sql;&lt;BR /&gt;
    select * from new&lt;BR /&gt;
    except&lt;BR /&gt;
    select * from old;&lt;BR /&gt;
[/pre]</description>
      <pubDate>Tue, 05 Apr 2011 18:39:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43362#M11312</guid>
      <dc:creator>RickM</dc:creator>
      <dc:date>2011-04-05T18:39:14Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43363#M11313</link>
      <description>To RickM:  How would the PROC SQL example address the OP's stated objective with removing all duplicates within a given SAS file, while only retaining the "most recent" observation for a given "by variable list"?&lt;BR /&gt;
&lt;BR /&gt;
Scott Barry&lt;BR /&gt;
SBBWorks, Inc.</description>
      <pubDate>Tue, 05 Apr 2011 21:33:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43363#M11313</guid>
      <dc:creator>sbb</dc:creator>
      <dc:date>2011-04-05T21:33:38Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43364#M11314</link>
      <description>[pre]&lt;BR /&gt;
proc sort data=class ;&lt;BR /&gt;
 by name sex weight height;&lt;BR /&gt;
run;&lt;BR /&gt;
data class;&lt;BR /&gt;
 set class;&lt;BR /&gt;
 by name sex weight height;&lt;BR /&gt;
 if not (first.height and last.height) then delete;&lt;BR /&gt;
run;&lt;BR /&gt;
[/pre]&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
Ksharp</description>
      <pubDate>Wed, 06 Apr 2011 08:39:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43364#M11314</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2011-04-06T08:39:05Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43365#M11315</link>
      <description>sbb,&lt;BR /&gt;
&lt;BR /&gt;
From the OP:&lt;BR /&gt;
"eg: I have two datasets containing the same information, downloaded a month apart. The latest dataset contains all the observations which appear in the earlier set plus a few dozen new observations. "&lt;BR /&gt;
&lt;BR /&gt;
So the problem, as I interpreted it, is that they are starting out with two data sets.  The other solutions assume the data is already combined and removing duplicates within one dataset is the only way to solve the problem.  Why not find the new values before the two datasets are put together.</description>
      <pubDate>Wed, 06 Apr 2011 13:35:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43365#M11315</guid>
      <dc:creator>RickM</dc:creator>
      <dc:date>2011-04-06T13:35:06Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43366#M11316</link>
      <description>Is there a unique identifier in each record?  If so....&lt;BR /&gt;
&lt;BR /&gt;
proc sort data=one;by ID;run;&lt;BR /&gt;
proc sort data=two;by ID;run;&lt;BR /&gt;
&lt;BR /&gt;
data match;&lt;BR /&gt;
merge one (in=a) two (in=b);&lt;BR /&gt;
by ID;&lt;BR /&gt;
if a and b then from='M';&lt;BR /&gt;
if a and not b then from='A';&lt;BR /&gt;
if not a and b then from='B';&lt;BR /&gt;
run;&lt;BR /&gt;
proc freq data=match;&lt;BR /&gt;
table from;&lt;BR /&gt;
run;&lt;BR /&gt;
                 */the observations where from='B' are in the 2nd set only*/&lt;BR /&gt;
data new;&lt;BR /&gt;
set match;&lt;BR /&gt;
if from ne 'B' then delete;&lt;BR /&gt;
run;</description>
      <pubDate>Wed, 06 Apr 2011 14:05:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43366#M11316</guid>
      <dc:creator>GreggB</dc:creator>
      <dc:date>2011-04-06T14:05:53Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43367#M11317</link>
      <description>Hi there and thanks for all your help, &lt;BR /&gt;
&lt;BR /&gt;
I probably didn't explain things as clearly as I could, so to clear things up, my problem was two data sets (&lt;I&gt;'this month'&lt;/I&gt; and &lt;I&gt;'last month'&lt;/I&gt;) with identical variables (although the values of one variable 'annualpremium' may differ from month to month). &lt;I&gt;'last month'&lt;/I&gt; has an unidentified amount of observations which have been dropped and are not contained in &lt;I&gt;'this month'&lt;/I&gt;.&lt;BR /&gt;
&lt;BR /&gt;
I needed to produce one dataset containing only the dropped observations, ie: those that appear in &lt;I&gt;'last month&lt;/I&gt;' that don't appear in &lt;I&gt;'this month'&lt;/I&gt;. No other criteria at all.&lt;BR /&gt;
&lt;BR /&gt;
This is how I managed it in the end, probably not the most efficient method but it seems to have generated the right results........&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
proc sql;&lt;BR /&gt;
create table combined as&lt;BR /&gt;
      select      t1.PolicyNumber, &lt;BR /&gt;
                  t1.OriginalStartDate,&lt;BR /&gt;
                  t1.AnnualPremium,&lt;BR /&gt;
                  t2.PolicyNumber as PolicyNumber2&lt;BR /&gt;
      from  thismonth as t1      &lt;BR /&gt;
                        left join&lt;BR /&gt;
                        lastmonth as t2&lt;BR /&gt;
      on          t1.PolicyNumber = t2.PolicyNumber;&lt;BR /&gt;
quit;&lt;BR /&gt;
&lt;BR /&gt;
data droppedpolicy (keep = PolicyNumber OriginalStartDate AnnualPremium);&lt;BR /&gt;
      set combined;&lt;BR /&gt;
      if PolicyNumber = PolicyNumber2 then delete;&lt;BR /&gt;
run;</description>
      <pubDate>Wed, 06 Apr 2011 21:04:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43367#M11317</guid>
      <dc:creator>deleted_user</dc:creator>
      <dc:date>2011-04-06T21:04:01Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43368#M11318</link>
      <description>how about:&lt;BR /&gt;
proc sql;&lt;BR /&gt;
create table DroppedRecs as&lt;BR /&gt;
select&lt;BR /&gt;
  t1.*&lt;BR /&gt;
from&lt;BR /&gt;
  lastmonth t1&lt;BR /&gt;
  left join thismonth t2&lt;BR /&gt;
     on t1.policynumber=t2.policynumber&lt;BR /&gt;
where&lt;BR /&gt;
  t2.policynumber is null;&lt;BR /&gt;
quit;</description>
      <pubDate>Thu, 07 Apr 2011 02:05:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43368#M11318</guid>
      <dc:creator>DBailey</dc:creator>
      <dc:date>2011-04-07T02:05:23Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43369#M11319</link>
      <description>Hi&lt;BR /&gt;
&lt;BR /&gt;
Proc sort data=datasetname noduprec out=newdatasetname;&lt;BR /&gt;
by varname1 varname2;&lt;BR /&gt;
run;</description>
      <pubDate>Fri, 08 Apr 2011 15:50:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43369#M11319</guid>
      <dc:creator>sss</dc:creator>
      <dc:date>2011-04-08T15:50:53Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43370#M11320</link>
      <description>Hi,&lt;BR /&gt;
&lt;BR /&gt;
My suggestion is to watch Cynthia's video on the very same problem where she suggests using NODUPKEY &lt;BR /&gt;
this is the URL:&lt;BR /&gt;
&lt;BR /&gt;
&lt;A href="http://www.youtube.com/watch?v=Srs7VhEqBts" target="_blank"&gt;http://www.youtube.com/watch?v=Srs7VhEqBts&lt;/A&gt;</description>
      <pubDate>Fri, 08 Apr 2011 16:03:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/43370#M11320</guid>
      <dc:creator>R_A_G_</dc:creator>
      <dc:date>2011-04-08T16:03:08Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/347978#M63677</link>
      <description>&lt;P&gt;Her video was on the basic proc sort. It did not address what the OP was asking.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Apr 2017 03:34:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/347978#M63677</guid>
      <dc:creator>cadams47</dc:creator>
      <dc:date>2017-04-07T03:34:39Z</dc:date>
    </item>
    <item>
      <title>Re: Deleting ALL duplicate records</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/431130#M68658</link>
      <description>&lt;P&gt;I realize this is an old thread and the op already came up with a solution, but for future reference and my own development would the "distinct" keyword not have worked to remove all observations that were identical across all variables in proc sql? For instance,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;proc sql;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;create table x as&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;select DISTINCT (variable1),&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; variable2,&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; variable3,&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; variable4,&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; etc....&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; from old_table;&lt;/P&gt;&lt;P&gt;quit;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jan 2018 23:35:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Deleting-ALL-duplicate-records/m-p/431130#M68658</guid>
      <dc:creator>morgbri</dc:creator>
      <dc:date>2018-01-25T23:35:16Z</dc:date>
    </item>
  </channel>
</rss>

