Hi Guys,
I have a situation where i have thousands of sas datasets.In which there are duplicates too.
I need to find out the top 10 duplicate datasets which are taking most of my sas memory out of them.
When i say duplicates if there is a 75-80% match with other datasets, i consider them as duplicates.
Any help to solve this situation would be appreciated.
Thanks
Hi,
I hope you realize you will have to make a comparison of all datasets together.
For 1000 datasets that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...
Not sure how long it will take.
Try this:
%* Create test data;
data a b c d e;
do i=1 to 10;
uid=i;
if i > 4 then output a b c d e;
else if i > 3 then output a b c d;
else if i>2 then output a b c;
else if i>1 then output a b;
else output a;
end;
drop i;
run;
%* Determine Datasets to check;
PROC SQL;
CREATE VIEW v_all as
SELECT memname,nobs
from sashelp.vtable
where libname eq 'WORK' /*adapt*/
and memname not in ('__COMPARISON', 'V_ALL')
;
CREATE TABLE __COMPARISON AS
select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
from v_all a,v_all b
where b.memname>a.memname
order by a.memname,b.memname
;
DROP VIEW v_all;
QUIT;
%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
PROC SQL;
update __COMPARISON SET similarity=(
(select count(*) from ¤t. where &uniqueIDVar. in (select &uniqueIDVar. from &next.))/currentnobs
) where current eq "¤t." and next eq "&next."
;
QUIT;
%MEND compare;
%* Run comparisons;
DATA _NULL_;
set __COMPARISON;
call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;
Cheers,
- Cheers -
Are all datasets of same format? - same variable names?
Are there variables that exist only in part of the datasets?
What variables (IDs) are need to check matching?
Can it be that matchong observations by ID have different attributes in the other variables ?
Why or what created the different datasets? What created the duplicates ?
What do you want to do with "duplicate" datasets, as you defined ? or with the matching observations ?
Would you like to merge or create one dataset of all unique IDs ?
If yes - what would be the criteria for saving variables from one dataset and drop from the other ?
Hi @Shmuel the answer is a big yes to first three questions there can be all three situations and the id variable will be a unique identifier.
i am not clear with the 4th question and fifth question multiple users pulling same dataset from rdbms for different purposes or maybe same purposes has created the duplicates and if i can find the duplicates exactly i just want to delete them
Thanks
Hi,
I hope you realize you will have to make a comparison of all datasets together.
For 1000 datasets that means C(1000,2)=1000!/(2!(1000-2)!)=499500 comparisons...
Not sure how long it will take.
Try this:
%* Create test data;
data a b c d e;
do i=1 to 10;
uid=i;
if i > 4 then output a b c d e;
else if i > 3 then output a b c d;
else if i>2 then output a b c;
else if i>1 then output a b;
else output a;
end;
drop i;
run;
%* Determine Datasets to check;
PROC SQL;
CREATE VIEW v_all as
SELECT memname,nobs
from sashelp.vtable
where libname eq 'WORK' /*adapt*/
and memname not in ('__COMPARISON', 'V_ALL')
;
CREATE TABLE __COMPARISON AS
select a.memname as current,a.nobs as currentnobs, b.memname as next, . as similarity format=percent7.1
from v_all a,v_all b
where b.memname>a.memname
order by a.memname,b.memname
;
DROP VIEW v_all;
QUIT;
%* Prepare comparison macro;
%MACRO compare(uniqueIDVar,current,next);
PROC SQL;
update __COMPARISON SET similarity=(
(select count(*) from ¤t. where &uniqueIDVar. in (select &uniqueIDVar. from &next.))/currentnobs
) where current eq "¤t." and next eq "&next."
;
QUIT;
%MEND compare;
%* Run comparisons;
DATA _NULL_;
set __COMPARISON;
call execute('%nrstr(%compare(uid,'||strip(current)||','||strip(next)||'))');
RUN;
Cheers,
- Cheers -
Hi @Oligolas that was an excellent poc that helped me out. But it is comparing table c with table d and e only but i want to compare table c with table a and table b too .
And also i have nearly 100 variables in each dataset to compare can you help me out there
And another concern is what if there is no common variable in two datasets.
Thanks very much
Hi,
if you already compared a with c, you do not need to compare c with a.
Regarding the 100 variables... is this an issue you still require assistance on? if so, are these 100 variables always available in all datasets? if you have those 100 variables in a and let's say only 85 of them in b (15 variables of 'a' are'nt available in 'b'), what should be the result of the comparison?
Can you perhaps provide 3-4 datasets for testing?
- Cheers -
If all datasets are selections from a dbms, why not replace them
by one dataset which is a "mirror" of the dbms, and delete the all selections.
Then each user will create its own temporary selection according to his/her needs,
or even create views onle of that "mirror".
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.