02-09-2017 05:19 AM
Is there somebody who knows how to delete a dataset from a library if another dataset has exactly the same content?
I have a long, long, list of datasets representing each calender day of several years. At some dates the dataset itself is not updated, and I'd like to keep only those that tracks any change. The amount of duplicates may vary.
Thank you in advance!
02-09-2017 07:27 AM
Thank you @draycut. This solutions is however too simple since I have about 500 datasets, and I do not know which of them that are equal. They contain about 5000 observations each so I cannot check it either. What I am looking for is some commands that can perform something like
for dataset 1 to 500
if dataset_i = dataset_:
02-09-2017 11:45 AM
You can use the information that SAS keeps track of to reduce the likely candidates.
Proc sql; create table possibledup as select memname, b.memname as possibledup from (select * dictionary.tables where libname='LIB') as a left join (select * dictionary.tables where libname='LIB') as b on a.nobs=b.nobs and a.nvar=b.nvar and a.num_character=b.num_character and a.num_numeric=b.num_numeric where a.memname ne b.memname and a.memname<b.memname order by a.nobs,a.nvar; quit;
will match up data sets in the library, replace LIB with the name of your library in UPPER case, that have the same number of variables of the same types and the same number of observations.
After you have the likely candidates you could use proc sql to look at dictionary.columns to see if all of the the variable names match.
If so then you will get into some manual steps as comparison of values, possibly best done by proc compare will require sorting each pair of candidate data sets by the same variables for the comparison to have much of a chance. The brief summaries from Proc Compare are likely sufficient.
Note that I would also examine your current processes that allow possible creations of duplications.
Another approach might be to combine like structured data sets, which you sort of imply is common, and then sort a combined data set with NODUP and then have a few datasets with all of the data. Append as needed going forward.
02-09-2017 12:20 PM - edited 02-10-2017 12:01 PM
The following macro will move duplicate datasets within the first library to a second library.
By duplicate datasets I assume, EXACTLY the same layout (variable names, types and lenghts) and EXACTLY the same data.
The code is not simple easy, but not from the other world also. I've placed some comments at every step.
First it will compare layouts... Then, for the ones with the same layout it will run a PROC COMPARE in pairs to check if data is exactly the same... Finally it will move one of them to the second LIB.
It will create some work tables in the WORK library, so obviously do not use WORK as INLIB/OUTIB.
%macro move_dup_dataset_from_lib(INLIB,OUTLIB); proc contents data=&INLIB.._ALL_ out=WORK._ALL memtype=DATA noprint; run; * retrieve all contents in INLIB; proc sql noprint; create table WORK._DUPS as select A1.MEMNAME, A1.NAME, A1.TYPE, A1.LENGTH, A1.NOBS, A2.TOTVAR from _ALL as A1 left join (select MEMNAME, count(*) as TOTVAR from _ALL group by MEMNAME) as A2 on A1.MEMNAME = A2.MEMNAME where A1.MEMNAME not in (select distinct MEMNAME from _ALL as B1, (select NAME, TYPE, LENGTH, NOBS, count(*) as COUNT from _ALL group by NAME, TYPE, LENGTH having COUNT = 1) as B2 where B1.NAME = B2.NAME and B1.TYPE = B2.TYPE and B1.LENGTH = B2.LENGTH); quit; * retrieve duplicated layouts (same name, type and length); proc sort; by NAME TYPE LENGTH NOBS TOTVAR; run; data WORK._PAIRS; set WORK._DUPS; retain BASE; keep BASE DATA; by NAME TYPE LENGTH NOBS TOTVAR; DATA=MEMNAME; if first.TOTVAR + last.TOTVAR lt 2; if first.TOTVAR then BASE=MEMNAME; else output; proc sort noduprecs; by BASE DATA; run; * eliminate datasets with layouts/data within other and create pairs; %let _PAIRS=; proc sql noprint; select BASE, DATA into :_BASES separated by ' ', :_DATAS separated by ' ' from WORK._PAIRS; quit; * load pairs lists into macro variables for cycling; * cycle through pairs and move if the same; %do _I=1 %to %sysfunc(countw(&_BASES,%str( ))); %let _BASE=&INLIB..%scan(&_BASES,&_I); * get one base; %let _DATA=&INLIB..%scan(&_DATAS,&_I); * get one data; %if %sysfunc(exist(&_DATA)) %then %do; proc contents data=&_BASE out=WORK._CONTENT memtype=DATA noprint; run; * get contents; proc sql noprint; select NAME into :_NAMES separated by ' ' from WORK._CONTENT; quit; * get var names; proc sort data=&_BASE; by &_NAMES; * sort everything; proc sort data=&_DATA; by &_NAMES; * sort everything; run; * order; proc compare base=&_BASE compare=&_DATA out=WORK._DIF outnoequal noprint; run; * compare both; data _null_; call symput('_DIF','0'); run; * assume they are not equal; proc sql noprint; select count(*) eq 0 into :_DIF from WORK._DIF; quit; * check if any diffs; %if &_DIF %then %do; %put *** &_BASE same as &_DATA, moving the latest to &OUTLIB library ***; proc datasets lib=&INLIB nolist; copy out=&OUTLIB move; select %scan(&_DATA,2,%str(.)); quit; * move it; %end; %end; %end; proc datasets lib=WORK nolist; delete _ALL _CONTENT _DIF _DUPS _PAIRS; quit; * delete work tables; %mend move_dup_dataset_from_lib; %move_dup_dataset_from_lib(MYLIB,DUPS);
Hope it helps.
Daniel Santos @ www.cgd.pt
02-09-2017 01:47 PM
I start encountering errors at this line:
proc contents data=&_BASE out=_CONTENT noprint; run; * get contents;
The error is :
ERROR: File USER.TEST1.DATA does not exist.
I suspect others that don't use a USER library would get WORK.TEST1. TEST1 is a set in my MYLIB library created to test this.
I think the library is getting left out of the _BASES and _DATAS macro variables.
02-10-2017 03:38 AM
02-10-2017 11:25 AM
@DanielSantos I may be misunderstanding the OP requirement to include equal content and not just data set structure/size with his requirement "delete a dataset from a library if another dataset has exactly the same content"
I created 4 data sets to test your macro:
data mylib.test1; input x y; datalines; 1 2 3 4 5 6 ; run; data mylib.test2; set mylib.test1; run; data mylib.test3; input y x ; datalines; 1 2 3 4 5 6 ; run; data mylib.test4; input y x ; datalines; 21 2 43 4 65 6 ; run;
And the result has only TEST1 remaining in the MYLIB. If I were worried about the content of the data sets then TEST2 would be the only one removed.
As a minor aside, I have a minor issue with coding this
proc datasets lib=WORK nolist;
delete _ALL _CONTENT _DIF _DUPS _PAIRS;
when you did not explicitly place the _ALL _CONTENT _DIF _DUPS _PAIRS in the work library. I am one of those odd ducks that use a USER library and the temporary sets go to USER not WORK.
Though I do appreciate attempts to clean up unneeded sets.
02-10-2017 11:56 AM
Again you are absolutely right, thank you for noticing. Didn't had much time to a propper testing.
I had mixed compile/runtime macro handling inside the loop, which compromised the value of _DIF var.
I've corrected the code, It now works properly with your example (only test2 is moved).
Oh, and yes... That's a bad coding habit of mine, the implicit specification of the WORK library for auxiliary data.
Thank you very much for the feedback.
Daniel Santos @ www.cgd.pt
02-09-2017 06:42 PM
I don't know if it's helpful or not, but another way to go about this would be to just use a hash function (SHA1, MD5, etc) on the datasets and see which ones have the same hash value. This could be coded in SAS using the X command or using a shell script or from the terminal. This assumes the datasets are identical other than the name of the file though and that your in a Linux environment (it's possibly in windows too I assume, but not as straightforward).
02-16-2017 09:05 AM
proc sort data= HAVE nodupkey out = WANT;
by Var1 Var2 Var3... ; * All variables that must be equal if the observation should be deleted ;
to delete the undesired observations. This felt most safe since I could understand everything in that code.
Again, thank you for your effort.
03-09-2017 04:07 PM
how abot using return codes from proc compare
proc sort data=a1;
proc sort data=a2;
proc compare base=a1 compare=a2;
%put 'RC' &rc;
if &rc=0 then
proc datasets lib=work;
You can combine the above proc sql/contents to get library/member names.