topic Re: Identifying Duplicates datasets in SAS Programming

Identifying Duplicates datasets

User12 — Fri, 19 May 2017 15:12:45 GMT

Hi Guys,

I have a situation where i have thousands of sas datasets.In which there are duplicates too.

I need to find out the top 10 duplicate datasets which are taking most of my sas memory out of them.

When i say duplicates if there is a 75-80% match with other datasets, i consider them as duplicates.

Any help to solve this situation would be appreciated.

Thanks

Re: Identifying Duplicates datasets

Shmuel — Fri, 19 May 2017 15:28:31 GMT

Are all datasets of same format? - same variable names?

Are there variables that exist only in part of the datasets?

What variables (IDs) are need to check matching?

Can it be that matchong observations by ID have different attributes in the other variables ?

Why or what created the different datasets? What created the duplicates ?

What do you want to do with "duplicate" datasets, as you defined ? or with the matching observations ?

Would you like to merge or create one dataset of all unique IDs ?

If yes - what would be the criteria for saving variables from one dataset and drop from the other ?

Re: Identifying Duplicates datasets

User12 — Mon, 22 May 2017 13:44:49 GMT

Hi @Shmuel the answer is a big yes to first three questions there can be all three situations and the id variable will be a unique identifier.

i am not clear with the 4th question and fifth question multiple users pulling same dataset from rdbms for different purposes or maybe same purposes has created the duplicates and if i can find the duplicates exactly i just want to delete them

Thanks

Re: Identifying Duplicates datasets

Oligolas — Tue, 23 May 2017 09:53:23 GMT

Hi,

I hope you realize you will have to make a comparison of all datasets together.

For 1000 datasets that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...

Not sure how long it will take.

Try this:

%* Create test data;
data a b c d e;
   do i=1 to 10;
   uid=i;
   if i > 4 then output a b c d e;
   else if i > 3 then output a b c d;
   else if i>2 then output a b c;
   else if i>1 then output a b;
   else output a;
   end; 
   drop i;
run;

%* Determine Datasets to check;
PROC SQL;
   CREATE VIEW v_all as
   SELECT memname,nobs 
   from sashelp.vtable
   where libname eq 'WORK' /*adapt*/
   and memname not in ('__COMPARISON', 'V_ALL')
   ;
   CREATE TABLE __COMPARISON AS
   select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
   from v_all a,v_all b
   where b.memname>a.memname
   order by a.memname,b.memname
   ;
   DROP VIEW v_all;
QUIT;

%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
   PROC SQL;
      update __COMPARISON SET similarity=(
         (select count(*) from &current. where &uniqueIDVar. in (select &uniqueIDVar. from &next.))/currentnobs
      ) where current eq "&current." and next eq "&next."
      ;  
   QUIT;
%MEND compare;

%* Run comparisons;
DATA _NULL_;
   set __COMPARISON;
   call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;

Cheers,

Re: Identifying Duplicates datasets

Shmuel — Tue, 23 May 2017 11:54:26 GMT

If all datasets are selections from a dbms, why not replace them

by one dataset which is a "mirror" of the dbms, and delete the all selections.

Then each user will create its own temporary selection according to his/her needs,

or even create views onle of that "mirror".

Re: Identifying Duplicates datasets

User12 — Thu, 25 May 2017 20:10:44 GMT

Hi @Oligolas that was an excellent poc that helped me out. But it is comparing table c with table d and e only but i want to compare table c with table a and table b too .

And also i have nearly 100 variables in each dataset to compare can you help me out there

And another concern is what if there is no common variable in two datasets.

Thanks very much

Re: Identifying Duplicates datasets

User12 — Thu, 25 May 2017 20:03:48 GMT

Hi @Shmuel they are from dbms but they have been pulled into sas and stored in sas already so i need to eliminate the unnecessary ones

Re: Identifying Duplicates datasets

Oligolas — Mon, 12 Jun 2017 08:42:13 GMT

Hi,

if you already compared a with c, you do not need to compare c with a.
Regarding the 100 variables... is this an issue you still require assistance on? if so, are these 100 variables always available in all datasets? if you have those 100 variables in a and let's say only 85 of them in b (15 variables of 'a' are'nt available in 'b'), what should be the result of the comparison?

Can you perhaps provide 3-4 datasets for testing?