Hi all,
We have 10tb+ of 10k+ sas7bdat files in a complex folder structure. What would be the fastest and most efficient way to check all of them for file damage (not repair), say due to download/transfer problems, corrupted sectors and what not?
Thank you,
A
Perhaps use a macro or some other file crawler to find them all, then run a job where you read each dataset in its own data _NULL_ step. That will force SAS to iterate through all the data. And the log should should show any errors for failing to read corrupted data.
That said, I don't have much experiences with corrupted SAS datasets (happily). It might be overkill to read through every record of every dataset. But if you pass that test, I think it would be good evidence that your datasets are ok.
Thank you! The _NULL_ part makes complete sense.
Use dirsAndFiles
macro basePlus package, documentation is here: https://github.com/SASPAC/baseplus/blob/main/baseplus.md#dirsandfiles-macro
To install BasePlus use SAS Package Framework: https://github.com/yabwon/SAS_PACKAGES
Do the testing like this:
*) first 3 lines = enapbe SPF (temporary setup)
*) next 2 lines = install basePlus and and load only dirsAndFiles
macro
*) call to the macro will create WORK.SASDatasets
dataset containing all sas dataset in /path/to/your/data/
path
*) create temporary file for log and start logging into _T_
*) turn off some options
*) for each observation from WORK.SASDatasets
set run call execute which will run small data set which tries to open sas dataset
*) turn off logging into _T_
*) read _T_ for errors
filename packages "%sysfunc(pathname(work))"; /* setup WORK as temporary directory for packages */
filename SPFinit url "https://raw.githubusercontent.com/yabwon/SAS_PACKAGES/main/SPF/SPFinit.sas";
%include SPFinit; /* enable the framework */
%installPackage(BasePlus) /* install package */
%loadPackage(BasePlus, cherryPick=dirsAndFiles)
%dirsAndFiles(
/path/to/your/data/
,ODS=WORK.SASDatasets
,details=0
,keepDirs=0
,keepFiles=1
,longFormat=1
,fileExt=sas7bdat
)
filename _T_ TEMP;
proc printto log =_T_;
run;
options NOFMTERR NOSOURCE NOSTIMER; /* !!! */
data _null_;
set WORK.SASDatasets;
by dname notsorted;
if first.dname then
call execute(catx(" ","libname test",quote(catx("/",root,dname)), ";"));
call execute( 'data _null_;' );
call execute( 'do ___point=1,ceil(___nobs/2),___nobs;' );
call execute( 'set test.' !! strip(scan(fn,1,".")) !! ' nobs=___nobs point=___point;');
call execute( 'end;stop;' );
call execute( 'run;' );
if last.dname then
call execute("libname test clear;");
run;
proc printto;
run;
data _null_;
run;
data _null_;
infile _T_;
input;
IF _INFILE_ =: "ERROR:" then
do;
PUT _N_=;
put _INFILE_ /;
end;
run;
/*
filename _T_ clear;
*/
Bart
Wow, thank you very much! Had that repo starred but didn't install it. Let me try this approach.
Hi,
Code works great! One detail though, could it be possible to output the folder or the path where the file is located?
I get this output
_N_=8159 ERROR: File TEST.NYAM_QT1983.DATA is damaged. I/O processing did not complete.
But there are several folders with the same filename so it's hard to know which file exactly is damaged.
Thank you!
First modify data step to name libraries with unique name:
data _null_;
set WORK.SASDatasets;
by dname notsorted;
if first.dname then
do;
lib+1;
call execute(catx(" ","libname t"!!strip(put(lib,best.)),quote(catx("/",root,dname)), ";"));
end;
call execute( 'data _null_;' );
call execute( 'do ___point=1,ceil(___nobs/2),___nobs;' );
call execute( 'set t' !! strip(put(lib,best.)) !! '.' !! strip(scan(fn,1,".")) !! ' nobs=___nobs point=___point;');
call execute( 'end;stop;' );
call execute( 'run;' );
if last.dname then
call execute("libname t" !! strip(put(lib,best.)) !! " clear;");
run;
so for every directory you will get unique library name string to search for.
And then search for the name:
data _null_;
infile _T_;
problematic_lib="T2"; /* <<<<<<<< */
input;
IF _INFILE_ =: "NOTE: Libref " !! upcase(problematic_lib) !! " was successfully" then
do;
PUT _N_=;
put _INFILE_ ;
input;
input;
put _INFILE_ / /;
end;
run;
Bart
Or even faster, without scanning log file:
data libraries( where = (library in ("T2" "T5")) );
set WORK.SASDatasets;
by dname notsorted;
if first.dname then
do;
lib+1;
library = cats("T",lib);
length path $ 2048;
path = catx("/",root,dname);
output;
end;
keep library path;
run;
Bart
Hi,
By the way, I wanted to say thank you for your efforts in developing the SAS packages framework! This should be recognized by SAS at the highest level.
About the list, I have 100s of files incomplete in what it seems 10s of different paths/libs with about 2 to 5 levels. Would it be possible for an error output to be like a list:?
z:\dir a\dir 1\ dir cc\xyz123.bdat
z:\dir f\dir 22\ dir dd\abc123.bdat
.....
This way i don't have to manually look for libs with the problematic files.
It also seems the errors are different some are not sas7bdat files at all and some are incomplete. Your original code by the way is very fast given the amount of data it looks at. It definitely doesn't read all of the files completely but can still identify problems.
Thank you very much!
Thanks for kind words, it's nice to hear that SPF is of value to someone 🙂
Lets back to the subject, try this (adjust paths for packages and search path):
filename packages "C:\SAS_WORK\SAS_PACKAGES"; /* directory for packages */
%include packages(SPFinit.sas);
%loadPackage(BasePlus, cherryPick=dirsAndFiles)
%dirsAndFiles(
C:\SAS_WORK /*/path/to/your/data/ */
,ODS=WORK.SASDatasets
,details=0
,keepDirs=0
,keepFiles=1
,longFormat=1
,fileExt=sas7bdat
)
filename _T_ TEMP;
proc printto log =_T_;
run;
options NOFMTERR NOSOURCE NOSTIMER LS=MAX PS=MAX; /* !!! */
data _null_;
set WORK.SASDatasets;
by dname notsorted;
if first.dname then
do;
lib+1;
call execute(catx(" ","libname t"!!strip(put(lib,best.)),quote(catx("/",root,dname)), ";"));
end;
call execute( 'data _null_;' );
call execute( 'do ___point=1,ceil(___nobs/2),___nobs;' );
call execute( 'set t' !! strip(put(lib,best.)) !! '.' !! strip(scan(fn,1,".")) !! ' nobs=___nobs point=___point;');
call execute( 'end;stop;' );
call execute( 'run;' );
if last.dname then
call execute("libname t" !! strip(put(lib,best.)) !! " clear;");
run;
proc printto;
run;
options FMTERR SOURCE STIMER;
data _null_;
run;
data ListOfErrors(where=(ErrTXT like '%.DATA%'));
infile _T_;
input;
length ErrTXT $ 256;
IF _INFILE_ =: "ERROR:" then
do;
put _N_ @10 ">>>" _INFILE_ /;
n = _N_;
ErrTXT = _infile_;
output;
end;
run;
data libraries (compress=char);
set WORK.SASDatasets;
by dname notsorted;
if first.dname then
do;
lib+1;
length library $ 8 path $ 2048;
library = cats("T",lib);
path = catx("/",root,dname);
output;
end;
keep library path;
run;
proc sort data=libraries;
by library;
run;
data listOfBrokenData;
set ListOfErrors;
p = PRXMATCH('/T(\d+)\.([1-9A-Z_]+)\.DATA/', ErrTXT);
length library $ 8 ds $ 51;
ds = scan(substr(ErrTXT,p),1," ");
library = scan(ds,1,".");
ds = lowcase(scan(ds,2,".")!!'.sas7bdat');
keep library ds;
run;
proc sort data=listOfBrokenData nodupkey;
by library ds;
run;
data listOfBrokenData;
merge listOfBrokenData(in=in) libraries;
by library;
if in;
path=catx('/',path,ds);
run;
proc sort data=listOfBrokenData sortseq=linguistic(numeric_colation=on);
by library ds;
run;
proc print;
run;
Bart
/*
You could try REPAIR statement in PROC DATASETS.
If there were some datasets corrupted ,you will find ERROR in log.
P.S. assuming you are using Windows and could run OS command by sas.
*/
%let path= c:\temp ;
options validvarname=any validmemname=extend;
filename x pipe %sysfunc(quote(dir "&path.\*.sas7bdat" /s /b));
data path;
infile x length=len;
input path $varying200. len;
lib=prxchange('s/[^\\]+$//',1,strip(path));
dsn=scan(path,-2,' .\');
run;
data _null_;
set path;
by lib notsorted;
if first.lib then call execute(catt('libname x v9 "',lib,'";%put NOTE:path=',lib,
';proc datasets library=x nolist nodetails memtype=data;repair '));
call execute(dsn);
if last.lib then call execute(';quit;libname x clear;');
run;;
Hi KSharp,
Thank you! Is there a way to make it identify the problems and not try to repair the dataset? I am not familiar with this proc function. In case there is an error, we will just redownload the file. We definitely don't want to repair them or alter them in any way at this stage.
Is there a way to change the 2nd block with proc datasets to make it just test the file without repairing?
@astrae_research Below some code that will list all SAS tables (.sas7bdat files) that SAS couldn't open successfully.
The os dir command will return all existing .sas7bdat files under &root_path recursively (I'd be using the find command for Linux).
The SAS open() function will then attempt to open these files.
I've opened as SAS file with a text editor and just deleted a few bytes. That's the one which gets captured in below example.
%let root_path=c:\temp;
filename filelist pipe "dir /B/S &root_path.\*.sas7bdat";
data work.failed_SASTable_open;
length pathAndSASFile $1000 msg $256;
keep pathAndSASFile msg;
infile filelist truncover;
input pathAndSASFile $1000.;
_did=open(strip(pathAndSASFile),,,'f');
if _did =0 then
do;
msg=sysmsg();
output;
end;
else
do;
_rc=close(_did);
end;
run;
filename filelist clear;
Given your use case you can of course enhance above to only include files create after some date. The basic idea is to create a list of *.sas7bdat files where you store the full path and filename including the suffix and then use the SAS open() function with the f parameter to test if you can open the file.
A different approach would be to create a hash (checksum) of your files in source, use the same algorithm in target to also create a hash and then compare the two.
If you're moving data from a recent SAS version in source then you could use SAS function HASHING_FILE() - else you would need some other tool for this.
@Patrick Great idea!! I wish I came up with
_did=open(strip(pathAndSASFile),,,'f');
myself. Definitely more elegant than my "log scanning". [jealous face emoji] 😉
One thing I would add is that I would wrapped it up inside DuSubL() function, so all that temporary "WC000000XXX" libraries, created when open() function works, won't be assigned in the main session:
filename packages "C:\SAS_WORK\SAS_PACKAGES"; /* directory for packages */
%include packages(SPFinit.sas);
%loadPackage(BasePlus, cherryPick=dirsAndFiles)
%dirsAndFiles(
C:\SAS_WORK /*/path/to/your/data/ */
,ODS=WORK.SASDatasets
,details=0
,keepDirs=0
,keepFiles=1
,longFormat=1
,fileExt=sas7bdat
)
data _null_;
rc= DoSubL("
filename _T_ DUMMY;
proc printto log =_T_;
run;
data work.failed_SASTable_open;
set WORK.SASDatasets;
length pathAndSASFile $ 2048;
pathAndSASFile = catx('/',root,dname,fn);
_did=open(strip(pathAndSASFile),,,'f');
if _did =0 then
do;
msg=sysmsg();
output;
end;
else
do;
_rc=close(_did);
end;
keep pathAndSASFile msg;
run;
proc printto;
run;
");
run;
Bart
P.S. @PaigeMiller - maybe you will like this "practical" use of DoSubL().
Hi Bart,
I wasn't even aware that the open() function creates all these libraries in the background and I'm a bit disappointed that the close() function doesn't clean them up.
I've done a test with a pre-existing wc000012 library and SAS recognized that there was already such a pre-existing libref and didn't overwrite it but just jumped to the next available number.
I consider using dosubl() makes the code unnecessarily complicated for this use case. Given that this code gets likely run isolated in its own session even all these wc... libraries are imho about cosmetics. If you really need to get rid of them then I'd go for a libname function to clean them out. Below code will do "the right thing" under the assumption that there is no pre-existing wc.... libref.
%let root_path=c:\temp;
filename filelist pipe "dir /B/S &root_path.\*.sas7bdat";
data work.failed_SASTable_open;
length pathAndSASFile $1000 msg $256;
keep pathAndSASFile msg;
infile filelist truncover;
input pathAndSASFile $1000.;
_did=open(strip(pathAndSASFile),,,'f');
if _did =0 then
do;
msg=sysmsg();
output;
end;
else
do;
_rc=close(_did);
_n+1;
_rc=libname(cats('wc',put(_n,z6.)));
end;
run;
filename filelist clear;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.