Re: Delete Dataset if a Duplicate Exists

MiniRadde · Posted 02-09-2017 05:19 AM

Hello.

Is there somebody who knows how to delete a dataset from a library if another dataset has exactly the same content?

I have a long, long, list of datasets representing each calender day of several years. At some dates the dataset itself is not updated, and I'd like to keep only those that tracks any change. The amount of duplicates may vary.

Thank you in advance!

PeterClemmensen · Posted 02-09-2017 05:36 AM

This blog post will probably be of help:

http://blogs.sas.com/content/iml/2011/06/27/check-that-a-sas-data-set-exists-and-other-data-maintena...

The DATA to DATA Step Macro
Blog: SASnrd

MiniRadde · Posted 02-09-2017 07:27 AM

Thank you @PeterClemmensen. This solutions is however too simple since I have about 500 datasets, and I do not know which of them that are equal. They contain about 5000 observations each so I cannot check it either. What I am looking for is some commands that can perform something like

for dataset 1 to 500

if dataset_i = dataset_:

drop dataset

end

ballardw · Posted 02-09-2017 11:45 AM

You can use the information that SAS keeps track of to reduce the likely candidates.

Proc sql;
   create table possibledup as
   select memname, b.memname as possibledup
   from (select * dictionary.tables where libname='LIB') as a left join 
        (select * dictionary.tables where libname='LIB') as b
       on a.nobs=b.nobs and a.nvar=b.nvar and
       a.num_character=b.num_character and a.num_numeric=b.num_numeric
   where a.memname ne b.memname and a.memname<b.memname
   order by a.nobs,a.nvar;
quit;

will match up data sets in the library, replace LIB with the name of your library in UPPER case, that have the same number of variables of the same types and the same number of observations.

After you have the likely candidates you could use proc sql to look at dictionary.columns to see if all of the the variable names match.

If so then you will get into some manual steps as comparison of values, possibly best done by proc compare will require sorting each pair of candidate data sets by the same variables for the comparison to have much of a chance. The brief summaries from Proc Compare are likely sufficient.

Note that I would also examine your current processes that allow possible creations of duplications.

Another approach might be to combine like structured data sets, which you sort of imply is common, and then sort a combined data set with NODUP and then have a few datasets with all of the data. Append as needed going forward.

DanielSantos · Posted 02-09-2017 12:20 PM

Hi.

The following macro will move duplicate datasets within the first library to a second library.

By duplicate datasets I assume, EXACTLY the same layout (variable names, types and lenghts) and EXACTLY the same data.

The code is not simple easy, but not from the other world also. I've placed some comments at every step.

First it will compare layouts... Then, for the ones with the same layout it will run a PROC COMPARE in pairs to check if data is exactly the same... Finally it will move one of them to the second LIB.

It will create some work tables in the WORK library, so obviously do not use WORK as INLIB/OUTIB.

%macro move_dup_dataset_from_lib(INLIB,OUTLIB);

proc contents data=&INLIB.._ALL_ out=WORK._ALL memtype=DATA noprint; 
run; * retrieve all contents in INLIB;

proc sql noprint;
create table WORK._DUPS as
select A1.MEMNAME, A1.NAME, A1.TYPE, A1.LENGTH, A1.NOBS, A2.TOTVAR from _ALL as A1
left join 
(select MEMNAME, count(*) as TOTVAR from _ALL group by MEMNAME) as A2
on A1.MEMNAME = A2.MEMNAME
where A1.MEMNAME not in 
(select distinct MEMNAME from _ALL as B1,
(select NAME, TYPE, LENGTH, NOBS, count(*) as COUNT 
from _ALL group by NAME, TYPE, LENGTH having COUNT = 1) as B2
where B1.NAME = B2.NAME and B1.TYPE = B2.TYPE and B1.LENGTH = B2.LENGTH);
quit; * retrieve duplicated layouts (same name, type and length);

proc sort; 
by NAME TYPE LENGTH NOBS TOTVAR; 
run;

data WORK._PAIRS;
     set WORK._DUPS;
	 retain BASE;
	 keep BASE DATA;
	 by NAME TYPE LENGTH NOBS TOTVAR;
	 DATA=MEMNAME;
	 if first.TOTVAR + last.TOTVAR lt 2;
     if first.TOTVAR then BASE=MEMNAME;
     else output;
proc sort noduprecs;
by BASE DATA;
run; * eliminate datasets with layouts/data within other and create pairs;

%let _PAIRS=;
proc sql noprint;
select BASE, DATA into :_BASES separated by ' ', :_DATAS separated by ' ' from WORK._PAIRS;
quit; * load pairs lists into macro variables for cycling;

* cycle through pairs and move if the same;
%do _I=1 %to %sysfunc(countw(&_BASES,%str( )));
%let _BASE=&INLIB..%scan(&_BASES,&_I); * get one base;
%let _DATA=&INLIB..%scan(&_DATAS,&_I); * get one data;
%if %sysfunc(exist(&_DATA)) %then %do;
proc contents data=&_BASE out=WORK._CONTENT memtype=DATA noprint; 
run; * get contents;
proc sql noprint;
select NAME into :_NAMES separated by ' ' from WORK._CONTENT;
quit; * get var names;

proc sort data=&_BASE; by &_NAMES; * sort everything;
proc sort data=&_DATA; by &_NAMES; * sort everything;
run; * order;
proc compare base=&_BASE compare=&_DATA
             out=WORK._DIF outnoequal noprint;
run; * compare both; 

data _null_;
     call symput('_DIF','0');
run; * assume they are not equal;
proc sql noprint;
select count(*) eq 0 into :_DIF from WORK._DIF;
quit; * check if any diffs;
%if &_DIF %then %do;
%put *** &_BASE same as &_DATA, moving the latest to &OUTLIB library ***;
proc datasets lib=&INLIB nolist;
copy out=&OUTLIB move;
select %scan(&_DATA,2,%str(.));
quit; * move it;
%end;
%end;
%end;

proc datasets lib=WORK nolist;
delete _ALL _CONTENT _DIF _DUPS _PAIRS;
quit; * delete work tables;

%mend move_dup_dataset_from_lib;

%move_dup_dataset_from_lib(MYLIB,DUPS);

Hope it helps.

Daniel Santos @ www.cgd.pt

ballardw · Posted 02-09-2017 01:47 PM

@DanielSantos

I start encountering errors at this line:

proc contents data=&_BASE out=_CONTENT noprint; 
run; * get contents;

The error is :

ERROR: File USER.TEST1.DATA does not exist.

I suspect others that don't use a USER library would get WORK.TEST1. TEST1 is a set in my MYLIB library created to test this.

I think the library is getting left out of the _BASES and _DATAS macro variables.

DanielSantos · Posted 02-10-2017 03:38 AM

@ballardw

You are right!.. Seems I've posted an incorrect version of this macro.

I have edited the post with the correct code, previous one would also fail with false positive for any subset of a table.

Thank you very much for noticing!

Daniel Santos @ www.cgd.pt

ballardw · Posted 02-10-2017 11:25 AM

@DanielSantos I may be misunderstanding the OP requirement to include equal content and not just data set structure/size with his requirement "delete a dataset from a library if another dataset has exactly the same content"

I created 4 data sets to test your macro:

data mylib.test1;
   input x y;
datalines;
1 2
3 4
5 6
;
run;
data mylib.test2;
   set mylib.test1;
run;
data mylib.test3;
   input y x ;
datalines;  
1 2
3 4
5 6
;
run;
data mylib.test4;
   input y x ;
datalines;  
21 2
43 4
65 6
;
run;

And the result has only TEST1 remaining in the MYLIB. If I were worried about the content of the data sets then TEST2 would be the only one removed.

As a minor aside, I have a minor issue with coding this

proc datasets lib=WORK nolist;

delete _ALL _CONTENT _DIF _DUPS _PAIRS;

quit;

when you did not explicitly place the _ALL _CONTENT _DIF _DUPS _PAIRS in the work library. I am one of those odd ducks that use a USER library and the temporary sets go to USER not WORK.

Though I do appreciate attempts to clean up unneeded sets.

DanielSantos · Posted 02-10-2017 11:56 AM

@ballardw

Again you are absolutely right, thank you for noticing. Didn't had much time to a propper testing.

I had mixed compile/runtime macro handling inside the loop, which compromised the value of _DIF var.

I've corrected the code, It now works properly with your example (only test2 is moved).

Oh, and yes... That's a bad coding habit of mine, the implicit specification of the WORK library for auxiliary data.

Thank you very much for the feedback.

Daniel Santos @ www.cgd.pt

Sven111 · Posted 02-09-2017 06:42 PM

I don't know if it's helpful or not, but another way to go about this would be to just use a hash function (SHA1, MD5, etc) on the datasets and see which ones have the same hash value. This could be coded in SAS using the X command or using a shell script or from the terminal. This assumes the datasets are identical other than the name of the file though and that your in a Linux environment (it's possibly in windows too I assume, but not as straightforward).

MiniRadde · Posted 02-16-2017 09:05 AM

Thank you so much @DanielSantos and @ballardw for the great effort. I chose to set all the datasets in a new one, although the file became enourmous, and then I used

proc sort data= HAVE nodupkey out = WANT;
by Var1 Var2 Var3... ; * All variables that must be equal if the observation should be deleted ;
run;

to delete the undesired observations. This felt most safe since I could understand everything in that code.

Again, thank you for your effort.

atzamis · Posted 03-09-2017 04:07 PM

Hi,

how abot using return codes from proc compare

data a1;
x=5;
run;

proc sort data=a1;

by ...;

run;

data a2;
x=5;
run;

proc sort data=a2;

by ...;

run;

proc compare base=a1 compare=a2;
run;

%let rc=&sysinfo;
%put 'RC' &rc;

data _null_;

if &rc=0 then

    call execute('
     proc datasets lib=work;
         delete a2;
      run;

');
run;

You can combine the above proc sql/contents to get library/member names.

http://support.sas.com/documentation/cdl/en/proc/69850/HTML/default/viewer.htm#n1jbbrf1tztya8n1tju77...

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away