BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
User12
Obsidian | Level 7

Hi Guys,

 

I have a situation where i have thousands of sas datasets.In which there are duplicates too.

I need to find out the top 10 duplicate datasets  which are taking most of my sas memory out of them.

When i say duplicates if there is a 75-80% match with other datasets, i consider them as duplicates.

 

Any help to solve this situation would be appreciated.

 

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
Oligolas
Barite | Level 11

Hi,

 

I hope you realize you will have to make a comparison of all datasets together.

For 1000 datasets that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...

Not sure how long it will take.

 

Try this:

%* Create test data;
data a b c d e;
   do i=1 to 10;
   uid=i;
   if i > 4 then output a b c d e;
   else if i > 3 then output a b c d;
   else if i>2 then output a b c;
   else if i>1 then output a b;
   else output a;
   end; 
   drop i;
run;

%* Determine Datasets to check;
PROC SQL;
   CREATE VIEW v_all as
   SELECT memname,nobs 
   from sashelp.vtable
   where libname eq 'WORK' /*adapt*/
   and memname not in ('__COMPARISON', 'V_ALL')
   ;
   CREATE TABLE __COMPARISON AS
   select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
   from v_all a,v_all b
   where b.memname>a.memname
   order by a.memname,b.memname
   ;
   DROP VIEW v_all;
QUIT;

%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
   PROC SQL;
      update __COMPARISON SET similarity=(
         (select count(*) from &current. where &uniqueIDVar. in (select &uniqueIDVar. from &next.))/currentnobs
      ) where current eq "&current." and next eq "&next."
      ;  
   QUIT;
%MEND compare;

%* Run comparisons;
DATA _NULL_;
   set __COMPARISON;
   call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;

Cheers,

 

________________________

- Cheers -

View solution in original post

7 REPLIES 7
Shmuel
Garnet | Level 18

Are all datasets of same format? - same variable names?

Are there variables that exist only in part of the datasets?

What variables (IDs) are need to check matching?

Can it be that matchong observations by ID have different attributes in the other variables ?

Why or what created the different datasets? What created the duplicates ?

What do you want to do with "duplicate" datasets, as you defined ? or with the matching observations ?

 

Would you like to merge or create one dataset of all unique IDs ?

If yes - what would be the criteria for saving variables from one dataset and drop from the other ?

 

 

 

User12
Obsidian | Level 7

Hi @Shmuel the answer is a big yes to first three questions there can be all three situations and the id variable will be a unique identifier.

i am not clear with the 4th question and fifth question multiple users pulling same dataset from rdbms for different purposes or maybe same purposes has created the duplicates and if i can find the duplicates exactly i just want to delete them 

 

 

Thanks

Oligolas
Barite | Level 11

Hi,

 

I hope you realize you will have to make a comparison of all datasets together.

For 1000 datasets that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...

Not sure how long it will take.

 

Try this:

%* Create test data;
data a b c d e;
   do i=1 to 10;
   uid=i;
   if i > 4 then output a b c d e;
   else if i > 3 then output a b c d;
   else if i>2 then output a b c;
   else if i>1 then output a b;
   else output a;
   end; 
   drop i;
run;

%* Determine Datasets to check;
PROC SQL;
   CREATE VIEW v_all as
   SELECT memname,nobs 
   from sashelp.vtable
   where libname eq 'WORK' /*adapt*/
   and memname not in ('__COMPARISON', 'V_ALL')
   ;
   CREATE TABLE __COMPARISON AS
   select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
   from v_all a,v_all b
   where b.memname>a.memname
   order by a.memname,b.memname
   ;
   DROP VIEW v_all;
QUIT;

%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
   PROC SQL;
      update __COMPARISON SET similarity=(
         (select count(*) from &current. where &uniqueIDVar. in (select &uniqueIDVar. from &next.))/currentnobs
      ) where current eq "&current." and next eq "&next."
      ;  
   QUIT;
%MEND compare;

%* Run comparisons;
DATA _NULL_;
   set __COMPARISON;
   call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;

Cheers,

 

________________________

- Cheers -

User12
Obsidian | Level 7

Hi @Oligolas that was an excellent poc that helped me out. But it is comparing table c with table d and e only but i want to compare table c with table a and table b too .

And also i have nearly 100 variables in each dataset to compare can you help me out there

And another concern is what if there is no common variable in two datasets.


Thanks very much

Oligolas
Barite | Level 11

Hi,

if you already compared a with c, you do not need to compare c with a.
Regarding the 100 variables... is this an issue you still require assistance on? if so, are these 100 variables always available in all datasets? if you have those 100 variables in a and let's say only 85 of them in b (15 variables of 'a' are'nt available in 'b'), what should be the result of the comparison?

Can you perhaps provide 3-4 datasets for testing?

________________________

- Cheers -

Shmuel
Garnet | Level 18

If all datasets are selections from a dbms, why not replace them

by one dataset which is a "mirror" of the dbms, and delete the all selections.

 

Then each user will create its own temporary selection according to his/her needs,

or even create views onle of that "mirror".

User12
Obsidian | Level 7
Hi @Shmuel they are from dbms but they have been pulled into sas and stored in sas already so i need to eliminate the unnecessary ones

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1071 views
  • 1 like
  • 3 in conversation