<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Identifying Duplicates datasets in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360021#M84691</link>
    <description>&lt;P&gt;Are all datasets of same format? - same variable names?&lt;/P&gt;
&lt;P&gt;Are there variables that exist only in part of the datasets?&lt;/P&gt;
&lt;P&gt;What variables (IDs) are need to check matching?&lt;/P&gt;
&lt;P&gt;Can it be that matchong observations by ID have different attributes in the other variables ?&lt;/P&gt;
&lt;P&gt;Why or what created the different datasets? What created the duplicates ?&lt;/P&gt;
&lt;P&gt;What do you want to do with "duplicate" datasets, as you defined ? or with the matching observations ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Would you like to merge or create one dataset of all unique IDs ?&lt;/P&gt;
&lt;P&gt;If yes - what would be the criteria for saving variables from one dataset and drop from the other ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 19 May 2017 15:28:31 GMT</pubDate>
    <dc:creator>Shmuel</dc:creator>
    <dc:date>2017-05-19T15:28:31Z</dc:date>
    <item>
      <title>Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360015#M84688</link>
      <description>&lt;P&gt;Hi Guys,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a situation where i have thousands of sas datasets.In which there are duplicates too.&lt;/P&gt;&lt;P&gt;I need to find out the top 10 duplicate datasets &amp;nbsp;which are taking most of my sas memory out of them.&lt;/P&gt;&lt;P&gt;When i say duplicates if there is a 75-80% match with other datasets, i consider them as duplicates.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any help to solve this situation would be appreciated.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2017 15:12:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360015#M84688</guid>
      <dc:creator>User12</dc:creator>
      <dc:date>2017-05-19T15:12:45Z</dc:date>
    </item>
    <item>
      <title>Re: Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360021#M84691</link>
      <description>&lt;P&gt;Are all datasets of same format? - same variable names?&lt;/P&gt;
&lt;P&gt;Are there variables that exist only in part of the datasets?&lt;/P&gt;
&lt;P&gt;What variables (IDs) are need to check matching?&lt;/P&gt;
&lt;P&gt;Can it be that matchong observations by ID have different attributes in the other variables ?&lt;/P&gt;
&lt;P&gt;Why or what created the different datasets? What created the duplicates ?&lt;/P&gt;
&lt;P&gt;What do you want to do with "duplicate" datasets, as you defined ? or with the matching observations ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Would you like to merge or create one dataset of all unique IDs ?&lt;/P&gt;
&lt;P&gt;If yes - what would be the criteria for saving variables from one dataset and drop from the other ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2017 15:28:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360021#M84691</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2017-05-19T15:28:31Z</dc:date>
    </item>
    <item>
      <title>Re: Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360432#M84865</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/88384"&gt;@Shmuel&lt;/a&gt;&amp;nbsp;the answer is a big yes to first three questions there can be all three situations and the id variable will be a unique identifier.&lt;/P&gt;&lt;P&gt;i am not clear with the 4th question and fifth question multiple users pulling same dataset from rdbms for different purposes or maybe same purposes has created the duplicates and if i can find the duplicates exactly i just want to delete them&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Mon, 22 May 2017 13:44:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360432#M84865</guid>
      <dc:creator>User12</dc:creator>
      <dc:date>2017-05-22T13:44:49Z</dc:date>
    </item>
    <item>
      <title>Re: Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360663#M84967</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I hope you realize you will have to make a comparison of all datasets together.&lt;/P&gt;&lt;P&gt;For 1000 datasets&amp;nbsp;that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...&lt;/P&gt;&lt;P&gt;Not sure how long it will take.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Try this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;%* Create test data;
data a b c d e;
   do i=1 to 10;
   uid=i;
   if i &amp;gt; 4 then output a b c d e;
   else if i &amp;gt; 3 then output a b c d;
   else if i&amp;gt;2 then output a b c;
   else if i&amp;gt;1 then output a b;
   else output a;
   end; 
   drop i;
run;

%* Determine Datasets to check;
PROC SQL;
   CREATE VIEW v_all as
   SELECT memname,nobs 
   from sashelp.vtable
   where libname eq 'WORK' /*adapt*/
   and memname not in ('__COMPARISON', 'V_ALL')
   ;
   CREATE TABLE __COMPARISON AS
   select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
   from v_all a,v_all b
   where b.memname&amp;gt;a.memname
   order by a.memname,b.memname
   ;
   DROP VIEW v_all;
QUIT;

%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
   PROC SQL;
      update __COMPARISON SET similarity=(
         (select count(*) from &amp;amp;current. where &amp;amp;uniqueIDVar. in (select &amp;amp;uniqueIDVar. from &amp;amp;next.))/currentnobs
      ) where current eq "&amp;amp;current." and next eq "&amp;amp;next."
      ;  
   QUIT;
%MEND compare;

%* Run comparisons;
DATA _NULL_;
   set __COMPARISON;
   call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Cheers,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 23 May 2017 09:53:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360663#M84967</guid>
      <dc:creator>Oligolas</dc:creator>
      <dc:date>2017-05-23T09:53:23Z</dc:date>
    </item>
    <item>
      <title>Re: Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360691#M84980</link>
      <description>&lt;P&gt;If all datasets are selections from a &lt;STRONG&gt;dbms&lt;/STRONG&gt;, why not replace them&lt;/P&gt;
&lt;P&gt;by one dataset which is a "mirror" of the dbms, and delete the all selections.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Then each user will create its own &lt;STRONG&gt;temporary&lt;/STRONG&gt; &lt;STRONG&gt;selection&lt;/STRONG&gt; according to his/her needs,&lt;/P&gt;
&lt;P&gt;or even create views onle of that "mirror".&lt;/P&gt;</description>
      <pubDate>Tue, 23 May 2017 11:54:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/360691#M84980</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2017-05-23T11:54:26Z</dc:date>
    </item>
    <item>
      <title>Re: Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/361755#M85362</link>
      <description>&lt;P&gt;Hi &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/77163"&gt;@Oligolas&lt;/a&gt; that was an excellent poc that helped me out. But it is comparing table c with table d and e only but i want to compare table c with table a and table b too .&lt;/P&gt;&lt;P&gt;And also i have nearly 100 variables in each dataset to compare can you help me out there&lt;/P&gt;&lt;P&gt;And another concern is what if there is no common variable in two datasets.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Thanks very much&lt;/P&gt;</description>
      <pubDate>Thu, 25 May 2017 20:10:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/361755#M85362</guid>
      <dc:creator>User12</dc:creator>
      <dc:date>2017-05-25T20:10:44Z</dc:date>
    </item>
    <item>
      <title>Re: Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/361760#M85363</link>
      <description>Hi &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/88384"&gt;@Shmuel&lt;/a&gt; they are from dbms but they have been pulled into sas and stored in sas already so i need to eliminate the unnecessary ones&lt;BR /&gt;</description>
      <pubDate>Thu, 25 May 2017 20:03:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/361760#M85363</guid>
      <dc:creator>User12</dc:creator>
      <dc:date>2017-05-25T20:03:48Z</dc:date>
    </item>
    <item>
      <title>Re: Identifying Duplicates datasets</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/366081#M87036</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;if you already compared a with c, you do not need to compare c with a.&lt;BR /&gt;Regarding the 100 variables... is this an issue you still require assistance on? if so, are these&amp;nbsp;100 variables&amp;nbsp;always available in all datasets? if you have those 100 variables in a and let's say only 85 of them in b (15 variables of 'a' are'nt available in 'b'), what should be the result of the comparison?&lt;BR /&gt;&lt;BR /&gt;Can you perhaps provide 3-4 datasets for testing?&lt;/P&gt;</description>
      <pubDate>Mon, 12 Jun 2017 08:42:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identifying-Duplicates-datasets/m-p/366081#M87036</guid>
      <dc:creator>Oligolas</dc:creator>
      <dc:date>2017-06-12T08:42:13Z</dc:date>
    </item>
  </channel>
</rss>

