DATA Step, Macro, Functions and more

Identifying Duplicates datasets

Accepted Solution Solved
Reply
Contributor
Posts: 27
Accepted Solution

Identifying Duplicates datasets

Hi Guys,

 

I have a situation where i have thousands of sas datasets.In which there are duplicates too.

I need to find out the top 10 duplicate datasets  which are taking most of my sas memory out of them.

When i say duplicates if there is a 75-80% match with other datasets, i consider them as duplicates.

 

Any help to solve this situation would be appreciated.

 

Thanks

 


Accepted Solutions
Solution
‎05-25-2017 03:46 PM
Frequent Contributor
Posts: 129

Re: Identifying Duplicates datasets

Hi,

 

I hope you realize you will have to make a comparison of all datasets together.

For 1000 datasets that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...

Not sure how long it will take.

 

Try this:

%* Create test data;
data a b c d e;
   do i=1 to 10;
   uid=i;
   if i > 4 then output a b c d e;
   else if i > 3 then output a b c d;
   else if i>2 then output a b c;
   else if i>1 then output a b;
   else output a;
   end; 
   drop i;
run;

%* Determine Datasets to check;
PROC SQL;
   CREATE VIEW v_all as
   SELECT memname,nobs 
   from sashelp.vtable
   where libname eq 'WORK' /*adapt*/
   and memname not in ('__COMPARISON', 'V_ALL')
   ;
   CREATE TABLE __COMPARISON AS
   select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
   from v_all a,v_all b
   where b.memname>a.memname
   order by a.memname,b.memname
   ;
   DROP VIEW v_all;
QUIT;

%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
   PROC SQL;
      update __COMPARISON SET similarity=(
         (select count(*) from &current. where &uniqueIDVar. in (select &uniqueIDVar. from &next.))/currentnobs
      ) where current eq "&current." and next eq "&next."
      ;  
   QUIT;
%MEND compare;

%* Run comparisons;
DATA _NULL_;
   set __COMPARISON;
   call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;

Cheers,

 

________________________

- Cheers -

View solution in original post


All Replies
Trusted Advisor
Posts: 1,584

Re: Identifying Duplicates datasets

Are all datasets of same format? - same variable names?

Are there variables that exist only in part of the datasets?

What variables (IDs) are need to check matching?

Can it be that matchong observations by ID have different attributes in the other variables ?

Why or what created the different datasets? What created the duplicates ?

What do you want to do with "duplicate" datasets, as you defined ? or with the matching observations ?

 

Would you like to merge or create one dataset of all unique IDs ?

If yes - what would be the criteria for saving variables from one dataset and drop from the other ?

 

 

 

Contributor
Posts: 27

Re: Identifying Duplicates datasets

Hi @Shmuel the answer is a big yes to first three questions there can be all three situations and the id variable will be a unique identifier.

i am not clear with the 4th question and fifth question multiple users pulling same dataset from rdbms for different purposes or maybe same purposes has created the duplicates and if i can find the duplicates exactly i just want to delete them 

 

 

Thanks

Solution
‎05-25-2017 03:46 PM
Frequent Contributor
Posts: 129

Re: Identifying Duplicates datasets

Hi,

 

I hope you realize you will have to make a comparison of all datasets together.

For 1000 datasets that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...

Not sure how long it will take.

 

Try this:

%* Create test data;
data a b c d e;
   do i=1 to 10;
   uid=i;
   if i > 4 then output a b c d e;
   else if i > 3 then output a b c d;
   else if i>2 then output a b c;
   else if i>1 then output a b;
   else output a;
   end; 
   drop i;
run;

%* Determine Datasets to check;
PROC SQL;
   CREATE VIEW v_all as
   SELECT memname,nobs 
   from sashelp.vtable
   where libname eq 'WORK' /*adapt*/
   and memname not in ('__COMPARISON', 'V_ALL')
   ;
   CREATE TABLE __COMPARISON AS
   select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
   from v_all a,v_all b
   where b.memname>a.memname
   order by a.memname,b.memname
   ;
   DROP VIEW v_all;
QUIT;

%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
   PROC SQL;
      update __COMPARISON SET similarity=(
         (select count(*) from &current. where &uniqueIDVar. in (select &uniqueIDVar. from &next.))/currentnobs
      ) where current eq "&current." and next eq "&next."
      ;  
   QUIT;
%MEND compare;

%* Run comparisons;
DATA _NULL_;
   set __COMPARISON;
   call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;

Cheers,

 

________________________

- Cheers -

Contributor
Posts: 27

Re: Identifying Duplicates datasets

[ Edited ]

Hi @Oligolas that was an excellent poc that helped me out. But it is comparing table c with table d and e only but i want to compare table c with table a and table b too .

And also i have nearly 100 variables in each dataset to compare can you help me out there

And another concern is what if there is no common variable in two datasets.


Thanks very much

Frequent Contributor
Posts: 129

Re: Identifying Duplicates datasets

[ Edited ]

Hi,

if you already compared a with c, you do not need to compare c with a.
Regarding the 100 variables... is this an issue you still require assistance on? if so, are these 100 variables always available in all datasets? if you have those 100 variables in a and let's say only 85 of them in b (15 variables of 'a' are'nt available in 'b'), what should be the result of the comparison?

Can you perhaps provide 3-4 datasets for testing?

________________________

- Cheers -

Trusted Advisor
Posts: 1,584

Re: Identifying Duplicates datasets

If all datasets are selections from a dbms, why not replace them

by one dataset which is a "mirror" of the dbms, and delete the all selections.

 

Then each user will create its own temporary selection according to his/her needs,

or even create views onle of that "mirror".

Contributor
Posts: 27

Re: Identifying Duplicates datasets

Hi @Shmuel they are from dbms but they have been pulled into sas and stored in sas already so i need to eliminate the unnecessary ones
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 204 views
  • 1 like
  • 3 in conversation