Directory: G:\FIN Credit Risk\Management Mode LastWriteTime Length Name ---- ------------- ------ ---- da---- 7/8/2020 11:30 AM Auditorias da---- 2/22/2021 12:37 PM Politicas y procedimientos Directory: G:\FIN Credit Risk\Management\Auditorias Mode LastWriteTime Length Name ---- ------------- ------ ---- d----- 4/19/2017 11:42 AM autorizaciones
I have a file with the text above called management.dir.test.txt. I have the exact same file zipped in management.dir.test.txt.zip.
When I run:
filename tt zip "&franriv/management.dir.test.txt.zip" member="management.dir.test.txt";
data testing;
infile tt;
input;
length L $500;
L=_infile_;
retain dir;
aa=find(L, "D");
if index(L, "Directory")=5
then dir=substr(L, 17);
bb=find(L, 'Mode');
if index(L, 'Mode') ne 1 and substr(L, 50, 4) ne '----';
run;
I get this:
Notice: (1) Has weird characters and (2) incorrectly found first "D" in 10th position through index function.
But when I run:
data testing;
infile "&franriv/management.dir.test.txt";
input;
length L $500;
L=_infile_;
retain dir;
aa=find(L, "D");
if index(L, "Directory")=5
then dir=substr(L, 17);
bb=find(L, 'Mode');
if index(L, 'Mode') ne 1 and substr(L, 50, 4) ne '----';
run;
I get:
Notice (1) it properly located first 'D' in 5th position.
I already tried combinations of TERMSTR=CRLF, RECFM=N, missover and truncover, but can't figure out why first code does not work.
I constantly read in *.zip files in data steps through filename zip. I could read this specific file outside the zip, but I need to understand what's going on (maybe even fix it) if I am to trust filename zip in the future (or otherwise be prepared for the possibility that it could fail).
Thanks!
That is call the BOM (byte order mark).
Yes the ZIP engine does NOT process the BOM.
You can either handle it yourself by skipping the first three bytes, perhaps you want to check them first?
data _null_;
infile 'c:\downloads\bom.zip' zip member='bom.txt' ;
if _n_=1 then do;
input @;
if 'EFBBBF'x=substrn(_infile_,1,3) then _infile_=substrn(_infile_,4);
end;
input;
list;
put _infile_ $hex12.;
stop;
run;
Results:
1487 data _null_; 1488 infile 'c:\downloads\bom.zip' zip member='bom.txt' encoding='any' ; 1489 if _n_=1 then do; 1490 input @; 1491 if 'EFBBBF'x=substr(_infile_,1,3) then _infile_=substr(_infile_,4); 1492 end; 1493 input; 1494 list; 1495 put _infile_ $hex12.; 1496 stop; 1497 run; NOTE: The infile 'c:\downloads\bom.zip' is: Filename=c:\downloads\bom.zip, Member Name=bom.txt,Size=6,Compressed Size=6, CRC-32=970C5C25,Date/Time=01-26-2022 22:27:54 787878 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 xxx 3 NOTE: 1 record was read from the infile 'c:\downloads\bom.zip'. The minimum record length was 6. The maximum record length was 6.
Or copy the file to a physical file and read that file. Then SAS will detect the BOM and not treat it as part of the real content of the file.
filename from zip 'c:\downloads\bom.zip' member='bom.txt' ;
filename to temp;
%let rc=%sysfunc(fcopy(from,to));
data _null_;
infile to;
input;
put _infile_ $hex12.;
list;
stop;
run;
Results:
1510 filename from zip 'c:\downloads\bom.zip' member='bom.txt' ; 1511 filename to temp; 1512 %let rc=%sysfunc(fcopy(from,to)); 1513 1514 data _null_; 1515 infile to; 1516 input; 1517 put _infile_ $hex12.; 1518 list; 1519 stop; 1520 run; NOTE: A byte-order mark in the file "...\#LN00086" (for fileref "TO") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile TO is: Filename=...\#LN00086, RECFM=V,LRECL=131068,File Size (bytes)=8, Last Modified=26Jan2022:22:42:48, Create Time=26Jan2022:22:42:48 787878 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 xxx 3 NOTE: 1 record was read from the infile TO. The minimum record length was 3. The maximum record length was 3.
That is call the BOM (byte order mark).
Yes the ZIP engine does NOT process the BOM.
You can either handle it yourself by skipping the first three bytes, perhaps you want to check them first?
data _null_;
infile 'c:\downloads\bom.zip' zip member='bom.txt' ;
if _n_=1 then do;
input @;
if 'EFBBBF'x=substrn(_infile_,1,3) then _infile_=substrn(_infile_,4);
end;
input;
list;
put _infile_ $hex12.;
stop;
run;
Results:
1487 data _null_; 1488 infile 'c:\downloads\bom.zip' zip member='bom.txt' encoding='any' ; 1489 if _n_=1 then do; 1490 input @; 1491 if 'EFBBBF'x=substr(_infile_,1,3) then _infile_=substr(_infile_,4); 1492 end; 1493 input; 1494 list; 1495 put _infile_ $hex12.; 1496 stop; 1497 run; NOTE: The infile 'c:\downloads\bom.zip' is: Filename=c:\downloads\bom.zip, Member Name=bom.txt,Size=6,Compressed Size=6, CRC-32=970C5C25,Date/Time=01-26-2022 22:27:54 787878 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 xxx 3 NOTE: 1 record was read from the infile 'c:\downloads\bom.zip'. The minimum record length was 6. The maximum record length was 6.
Or copy the file to a physical file and read that file. Then SAS will detect the BOM and not treat it as part of the real content of the file.
filename from zip 'c:\downloads\bom.zip' member='bom.txt' ;
filename to temp;
%let rc=%sysfunc(fcopy(from,to));
data _null_;
infile to;
input;
put _infile_ $hex12.;
list;
stop;
run;
Results:
1510 filename from zip 'c:\downloads\bom.zip' member='bom.txt' ; 1511 filename to temp; 1512 %let rc=%sysfunc(fcopy(from,to)); 1513 1514 data _null_; 1515 infile to; 1516 input; 1517 put _infile_ $hex12.; 1518 list; 1519 stop; 1520 run; NOTE: A byte-order mark in the file "...\#LN00086" (for fileref "TO") indicates that the data is encoded in "utf-8". This encoding will be used to process the file. NOTE: The infile TO is: Filename=...\#LN00086, RECFM=V,LRECL=131068,File Size (bytes)=8, Last Modified=26Jan2022:22:42:48, Create Time=26Jan2022:22:42:48 787878 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 xxx 3 NOTE: 1 record was read from the infile TO. The minimum record length was 3. The maximum record length was 3.
@franriv wrote:
Thanks! I'll look further into the BOM.
BOM explains weird caracter in first line read, but I still don't get why subsequent records have problem too.
Probably because you are treating them with the wrong encoding. If the file really is using UTF-8 encoding there might be some "characters" in it that require more than one byte in the line. If your current SAS session is using a single byte encoding, like WLATIN1 or LATIN1 then those will look like multiple character instead of one.
Try changing the ENCODING= option on the INFILE statement. Most likely you want to set it to UTF-8. Not sure how that will impact the interpretation of the three byte BOM. Try it and find out.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.