Re: The fastest/efficient way to check sas7bdat files damage?

astrae_research · Posted 04-25-2023 02:04 PM

Hi all,

We have 10tb+ of 10k+ sas7bdat files in a complex folder structure. What would be the fastest and most efficient way to check all of them for file damage (not repair), say due to download/transfer problems, corrupted sectors and what not?

Thank you,

A

Quentin · Posted 04-25-2023 03:30 PM

Perhaps use a macro or some other file crawler to find them all, then run a job where you read each dataset in its own data _NULL_ step. That will force SAS to iterate through all the data. And the log should should show any errors for failing to read corrupted data.

That said, I don't have much experiences with corrupted SAS datasets (happily). It might be overkill to read through every record of every dataset. But if you pass that test, I think it would be good evidence that your datasets are ok.

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

astrae_research · Posted 04-25-2023 07:53 PM

Thank you! The _NULL_ part makes complete sense.

yabwon · Posted 04-25-2023 04:12 PM

Use dirsAndFiles macro basePlus package, documentation is here: https://github.com/SASPAC/baseplus/blob/main/baseplus.md#dirsandfiles-macro

To install BasePlus use SAS Package Framework: https://github.com/yabwon/SAS_PACKAGES

Do the testing like this:

*) first 3 lines = enapbe SPF (temporary setup)

*) next 2 lines = install basePlus and and load only dirsAndFiles macro

*) call to the macro will create WORK.SASDatasets dataset containing all sas dataset in /path/to/your/data/ path

*) create temporary file for log and start logging into _T_

*) turn off some options

*) for each observation from WORK.SASDatasets set run call execute which will run small data set which tries to open sas dataset

*) turn off logging into _T_

*) read _T_ for errors

filename packages "%sysfunc(pathname(work))"; /* setup WORK as temporary directory for packages */
filename SPFinit url "https://raw.githubusercontent.com/yabwon/SAS_PACKAGES/main/SPF/SPFinit.sas";
%include SPFinit; /* enable the framework */

%installPackage(BasePlus) /* install package */
%loadPackage(BasePlus, cherryPick=dirsAndFiles) 


%dirsAndFiles(
  /path/to/your/data/
,ODS=WORK.SASDatasets
,details=0
,keepDirs=0
,keepFiles=1
,longFormat=1
,fileExt=sas7bdat
)


filename _T_ TEMP;
proc printto log =_T_;
run;

options NOFMTERR NOSOURCE NOSTIMER; /* !!! */

data _null_;
  set WORK.SASDatasets;
  by dname notsorted;

  if first.dname then
    call execute(catx(" ","libname test",quote(catx("/",root,dname)), ";"));

  call execute( 'data _null_;' );
  call execute( 'do ___point=1,ceil(___nobs/2),___nobs;' );
  call execute( 'set test.' !! strip(scan(fn,1,".")) !! ' nobs=___nobs point=___point;');
  call execute( 'end;stop;' );
  call execute( 'run;' );

  if last.dname then
    call execute("libname test clear;");

run;

proc printto;
run;
data _null_;
run;
 
data _null_;
  infile _T_;
  input;
  IF _INFILE_ =: "ERROR:" then
    do;
      PUT _N_=;
      put _INFILE_ /;
    end;
run;

/*
  filename _T_ clear;
*/

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

astrae_research · Posted 04-25-2023 07:53 PM

Wow, thank you very much! Had that repo starred but didn't install it. Let me try this approach.

astrae_research · Posted 04-25-2023 09:53 PM

Hi,

Code works great! One detail though, could it be possible to output the folder or the path where the file is located?

I get this output

_N_=8159
ERROR: File TEST.NYAM_QT1983.DATA is damaged. I/O processing did not complete.

But there are several folders with the same filename so it's hard to know which file exactly is damaged.

Thank you!

yabwon · Posted 04-26-2023 04:15 AM

First modify data step to name libraries with unique name:

data _null_;
  set WORK.SASDatasets;
  by dname notsorted;

  if first.dname then
    do;
      lib+1;
      call execute(catx(" ","libname t"!!strip(put(lib,best.)),quote(catx("/",root,dname)), ";"));
    end;

  call execute( 'data _null_;' );
  call execute( 'do ___point=1,ceil(___nobs/2),___nobs;' );
  call execute( 'set t' !! strip(put(lib,best.)) !! '.' !! strip(scan(fn,1,".")) !! ' nobs=___nobs point=___point;');
  call execute( 'end;stop;' );
  call execute( 'run;' );

  if last.dname then
    call execute("libname t" !! strip(put(lib,best.)) !! " clear;");
run;

so for every directory you will get unique library name string to search for.

And then search for the name:

data _null_;
  infile _T_;
  problematic_lib="T2"; /* <<<<<<<< */

  input;
  IF _INFILE_ =: "NOTE: Libref " !! upcase(problematic_lib) !! " was successfully" then
    do;
      PUT _N_=;
      put _INFILE_ ;
      input;
      input;
      put _INFILE_ / /;
    end;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

yabwon · Posted 04-26-2023 04:21 AM

Or even faster, without scanning log file:

data libraries( where = (library in ("T2" "T5")) );
  set WORK.SASDatasets;
  by dname notsorted;

  if first.dname then
    do;
      lib+1;
      library = cats("T",lib);
      length path $ 2048;
      path = catx("/",root,dname);
      output;
    end;
  keep library path;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

astrae_research · Posted 04-26-2023 07:56 PM

Hi,

By the way, I wanted to say thank you for your efforts in developing the SAS packages framework! This should be recognized by SAS at the highest level.

About the list, I have 100s of files incomplete in what it seems 10s of different paths/libs with about 2 to 5 levels. Would it be possible for an error output to be like a list:?

z:\dir a\dir 1\ dir cc\xyz123.bdat

z:\dir f\dir 22\ dir dd\abc123.bdat

.....

This way i don't have to manually look for libs with the problematic files.

It also seems the errors are different some are not sas7bdat files at all and some are incomplete. Your original code by the way is very fast given the amount of data it looks at. It definitely doesn't read all of the files completely but can still identify problems.

Thank you very much!

yabwon · Posted 04-27-2023 02:38 AM

Thanks for kind words, it's nice to hear that SPF is of value to someone 🙂

Lets back to the subject, try this (adjust paths for packages and search path):

filename packages "C:\SAS_WORK\SAS_PACKAGES"; /* directory for packages */
%include packages(SPFinit.sas);

%loadPackage(BasePlus, cherryPick=dirsAndFiles) 


%dirsAndFiles(
  C:\SAS_WORK /*/path/to/your/data/ */
,ODS=WORK.SASDatasets
,details=0
,keepDirs=0
,keepFiles=1
,longFormat=1
,fileExt=sas7bdat
)


filename _T_ TEMP;
proc printto log =_T_;
run;

options NOFMTERR NOSOURCE NOSTIMER LS=MAX PS=MAX; /* !!! */

data _null_;
  set WORK.SASDatasets;
  by dname notsorted;

  if first.dname then
    do;
      lib+1;
      call execute(catx(" ","libname t"!!strip(put(lib,best.)),quote(catx("/",root,dname)), ";"));
    end;

  call execute( 'data _null_;' );
  call execute( 'do ___point=1,ceil(___nobs/2),___nobs;' );
  call execute( 'set t' !! strip(put(lib,best.)) !! '.' !! strip(scan(fn,1,".")) !! ' nobs=___nobs point=___point;');
  call execute( 'end;stop;' );
  call execute( 'run;' );

  if last.dname then
    call execute("libname t" !! strip(put(lib,best.)) !! " clear;");
run;

proc printto;
run;

options FMTERR SOURCE STIMER;
data _null_;
run;
 
data ListOfErrors(where=(ErrTXT like '%.DATA%'));
  infile _T_;
  input;
  length ErrTXT $ 256;
  IF _INFILE_ =: "ERROR:" then
    do;
      put _N_ @10 ">>>" _INFILE_ /;
      n = _N_;
      ErrTXT = _infile_;
      output;
    end;
run;

data libraries (compress=char);
  set WORK.SASDatasets;
  by dname notsorted;

  if first.dname then
    do;
      lib+1;
      length library $ 8 path $ 2048;
      library = cats("T",lib);
      path = catx("/",root,dname);
      output;
    end;
  keep library path;
run;
proc sort data=libraries;
  by library;
run;


data listOfBrokenData;
  set ListOfErrors;
  p = PRXMATCH('/T(\d+)\.([1-9A-Z_]+)\.DATA/', ErrTXT);
  length library $ 8 ds $ 51;
  ds = scan(substr(ErrTXT,p),1," ");
  library = scan(ds,1,".");
  ds      = lowcase(scan(ds,2,".")!!'.sas7bdat');
  keep library ds;
run;
proc sort data=listOfBrokenData nodupkey;
  by library ds;
run;

data listOfBrokenData;
  merge listOfBrokenData(in=in) libraries;
  by library;
  if in;
  path=catx('/',path,ds);
run;
proc sort data=listOfBrokenData sortseq=linguistic(numeric_colation=on);
  by library ds;
run;
proc print;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ksharp · Posted 04-26-2023 07:48 AM

/*
You could try REPAIR statement in PROC DATASETS.
If there were some datasets corrupted ,you will find ERROR in log.
P.S. assuming you are using Windows and could run OS command by sas.
*/

%let path= c:\temp  ;


options validvarname=any validmemname=extend;
filename x pipe %sysfunc(quote(dir "&path.\*.sas7bdat" /s /b));
data path;
infile x length=len;
input path $varying200. len;
lib=prxchange('s/[^\\]+$//',1,strip(path));
dsn=scan(path,-2,' .\');
run;
data _null_;
 set path;
 by lib notsorted;
 if first.lib then call execute(catt('libname x v9 "',lib,'";%put NOTE:path=',lib,
 ';proc datasets library=x nolist nodetails memtype=data;repair '));
 call execute(dsn);
 if last.lib then call execute(';quit;libname x clear;');
run;;

astrae_research · Posted 04-26-2023 08:04 PM

Hi KSharp,

Thank you! Is there a way to make it identify the problems and not try to repair the dataset? I am not familiar with this proc function. In case there is an error, we will just redownload the file. We definitely don't want to repair them or alter them in any way at this stage.

Is there a way to change the 2nd block with proc datasets to make it just test the file without repairing?

Patrick · Posted 04-27-2023 07:32 AM

@astrae_research Below some code that will list all SAS tables (.sas7bdat files) that SAS couldn't open successfully.

The os dir command will return all existing .sas7bdat files under &root_path recursively (I'd be using the find command for Linux).

The SAS open() function will then attempt to open these files.

I've opened as SAS file with a text editor and just deleted a few bytes. That's the one which gets captured in below example.

%let root_path=c:\temp;
filename filelist pipe "dir /B/S &root_path.\*.sas7bdat";
data work.failed_SASTable_open;
  length pathAndSASFile $1000 msg $256;
  keep pathAndSASFile msg;

  infile filelist truncover;
  input pathAndSASFile $1000.;
  _did=open(strip(pathAndSASFile),,,'f');
  if _did =0 then 
    do;
      msg=sysmsg();
      output;
    end;
  else
    do;
      _rc=close(_did);
    end;
run;
filename filelist clear;

Given your use case you can of course enhance above to only include files create after some date. The basic idea is to create a list of *.sas7bdat files where you store the full path and filename including the suffix and then use the SAS open() function with the f parameter to test if you can open the file.

A different approach would be to create a hash (checksum) of your files in source, use the same algorithm in target to also create a hash and then compare the two.
If you're moving data from a recent SAS version in source then you could use SAS function HASHING_FILE() - else you would need some other tool for this.

yabwon · Posted 04-27-2023 08:13 AM

@Patrick Great idea!! I wish I came up with

_did=open(strip(pathAndSASFile),,,'f');

myself. Definitely more elegant than my "log scanning". [jealous face emoji] 😉

One thing I would add is that I would wrapped it up inside DuSubL() function, so all that temporary "WC000000XXX" libraries, created when open() function works, won't be assigned in the main session:

filename packages "C:\SAS_WORK\SAS_PACKAGES"; /* directory for packages */
%include packages(SPFinit.sas);

%loadPackage(BasePlus, cherryPick=dirsAndFiles) 

%dirsAndFiles(
  C:\SAS_WORK /*/path/to/your/data/ */
,ODS=WORK.SASDatasets
,details=0
,keepDirs=0
,keepFiles=1
,longFormat=1
,fileExt=sas7bdat
)

data _null_;
rc= DoSubL("
filename _T_ DUMMY;
proc printto log =_T_;
run;

data work.failed_SASTable_open;

  set WORK.SASDatasets;
  length pathAndSASFile $ 2048;
  pathAndSASFile = catx('/',root,dname,fn);

  _did=open(strip(pathAndSASFile),,,'f');
  if _did =0 then 
    do;
      msg=sysmsg();
      output;
    end;
  else
    do;
      _rc=close(_did);
    end;
  keep pathAndSASFile msg;
run;

proc printto;
run;
");
run;

Bart

P.S. @PaigeMiller - maybe you will like this "practical" use of DoSubL().

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Patrick · Posted 04-27-2023 08:32 AM

@yabwon

Hi Bart,

I wasn't even aware that the open() function creates all these libraries in the background and I'm a bit disappointed that the close() function doesn't clean them up.

I've done a test with a pre-existing wc000012 library and SAS recognized that there was already such a pre-existing libref and didn't overwrite it but just jumped to the next available number.

I consider using dosubl() makes the code unnecessarily complicated for this use case. Given that this code gets likely run isolated in its own session even all these wc... libraries are imho about cosmetics. If you really need to get rid of them then I'd go for a libname function to clean them out. Below code will do "the right thing" under the assumption that there is no pre-existing wc.... libref.

%let root_path=c:\temp;
filename filelist pipe "dir /B/S &root_path.\*.sas7bdat";
data work.failed_SASTable_open;
  length pathAndSASFile $1000 msg $256;
  keep pathAndSASFile msg;

  infile filelist truncover;
  input pathAndSASFile $1000.;
  _did=open(strip(pathAndSASFile),,,'f');
  if _did =0 then 
    do;
      msg=sysmsg();
      output;
    end;
  else
    do;
      _rc=close(_did);
      _n+1;
      _rc=libname(cats('wc',put(_n,z6.)));
    end;
run;
filename filelist clear;

Classroom Training Available!