BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
franriv
Obsidian | Level 7
    Directory: G:\FIN Credit Risk\Management


Mode                LastWriteTime         Length Name                                                                                
----                -------------         ------ ----                                                                                
da----         7/8/2020  11:30 AM                Auditorias                                                                          
da----        2/22/2021  12:37 PM                Politicas y procedimientos                                                          


    Directory: G:\FIN Credit Risk\Management\Auditorias


Mode                LastWriteTime         Length Name                                                                                
----                -------------         ------ ----                                                                                
d-----        4/19/2017  11:42 AM                autorizaciones                                                                      

I have a file with the text above called management.dir.test.txt. I have the exact same file zipped in management.dir.test.txt.zip.

 

When I run:

filename tt zip "&franriv/management.dir.test.txt.zip" member="management.dir.test.txt";

data testing;
	infile tt;
	input;
	length L $500;
	L=_infile_;

	retain dir;

	aa=find(L, "D");
	if index(L, "Directory")=5
		then dir=substr(L, 17);
	
	bb=find(L, 'Mode');
	if index(L, 'Mode') ne 1 and substr(L, 50, 4) ne '----';
run;

I get this:

zip1_2601.png

Notice: (1) Has weird characters and (2) incorrectly found first "D" in 10th position through index function.

 

But when I run:

data testing;
	infile "&franriv/management.dir.test.txt";
	input;
	length L $500;
	L=_infile_;

	retain dir;

	aa=find(L, "D");
	if index(L, "Directory")=5
		then dir=substr(L, 17);
	
	bb=find(L, 'Mode');
	if index(L, 'Mode') ne 1 and substr(L, 50, 4) ne '----';
run;

I get:

zip2_2601.png Notice (1) it properly located first 'D' in 5th position.

 

I already tried combinations of TERMSTR=CRLF, RECFM=N, missover and truncover, but can't figure out why first code does not work.

 

I constantly read in *.zip files in data steps through filename zip. I could read this specific file outside the zip, but I need to understand what's going on (maybe even fix it) if I am to trust filename zip in the future (or otherwise be prepared for the possibility that it could fail).

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

That is call the BOM (byte order mark).

Yes the ZIP engine does NOT process the BOM.

You can either handle it yourself by skipping the first three bytes, perhaps you want to check them first?

data _null_;
  infile 'c:\downloads\bom.zip' zip member='bom.txt'  ;
  if _n_=1 then do;
     input @;
     if 'EFBBBF'x=substrn(_infile_,1,3) then _infile_=substrn(_infile_,4);
  end;
  input;
  list;
  put _infile_ $hex12.;
  stop;
run;

Results:

1487  data _null_;
1488    infile 'c:\downloads\bom.zip' zip member='bom.txt' encoding='any' ;
1489    if _n_=1 then do;
1490       input @;
1491       if 'EFBBBF'x=substr(_infile_,1,3) then _infile_=substr(_infile_,4);
1492    end;
1493    input;
1494    list;
1495    put _infile_ $hex12.;
1496    stop;
1497  run;

NOTE: The infile 'c:\downloads\bom.zip' is:
      Filename=c:\downloads\bom.zip,
      Member Name=bom.txt,Size=6,Compressed Size=6,
      CRC-32=970C5C25,Date/Time=01-26-2022 22:27:54

787878
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         xxx 3
NOTE: 1 record was read from the infile 'c:\downloads\bom.zip'.
      The minimum record length was 6.
      The maximum record length was 6.

Or copy the file to a physical file and read that file. Then SAS will detect the BOM and not treat it as part of the real content of the file.

filename from zip 'c:\downloads\bom.zip' member='bom.txt' ;
filename to temp;
%let rc=%sysfunc(fcopy(from,to));

data _null_;
  infile to;
  input;
  put _infile_ $hex12.;
  list;
  stop;
run;

Results:

1510  filename from zip 'c:\downloads\bom.zip' member='bom.txt' ;
1511  filename to temp;
1512  %let rc=%sysfunc(fcopy(from,to));
1513
1514  data _null_;
1515    infile to;
1516    input;
1517    put _infile_ $hex12.;
1518    list;
1519    stop;
1520  run;

NOTE: A byte-order mark in the file "...\#LN00086"
      (for fileref "TO") indicates that the data is encoded in "utf-8".  This encoding will be used to process the file.
NOTE: The infile TO is:
      Filename=...\#LN00086,
      RECFM=V,LRECL=131068,File Size (bytes)=8,
      Last Modified=26Jan2022:22:42:48,
      Create Time=26Jan2022:22:42:48

787878
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         xxx 3
NOTE: 1 record was read from the infile TO.
      The minimum record length was 3.
      The maximum record length was 3.

 

View solution in original post

3 REPLIES 3
Tom
Super User Tom
Super User

That is call the BOM (byte order mark).

Yes the ZIP engine does NOT process the BOM.

You can either handle it yourself by skipping the first three bytes, perhaps you want to check them first?

data _null_;
  infile 'c:\downloads\bom.zip' zip member='bom.txt'  ;
  if _n_=1 then do;
     input @;
     if 'EFBBBF'x=substrn(_infile_,1,3) then _infile_=substrn(_infile_,4);
  end;
  input;
  list;
  put _infile_ $hex12.;
  stop;
run;

Results:

1487  data _null_;
1488    infile 'c:\downloads\bom.zip' zip member='bom.txt' encoding='any' ;
1489    if _n_=1 then do;
1490       input @;
1491       if 'EFBBBF'x=substr(_infile_,1,3) then _infile_=substr(_infile_,4);
1492    end;
1493    input;
1494    list;
1495    put _infile_ $hex12.;
1496    stop;
1497  run;

NOTE: The infile 'c:\downloads\bom.zip' is:
      Filename=c:\downloads\bom.zip,
      Member Name=bom.txt,Size=6,Compressed Size=6,
      CRC-32=970C5C25,Date/Time=01-26-2022 22:27:54

787878
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         xxx 3
NOTE: 1 record was read from the infile 'c:\downloads\bom.zip'.
      The minimum record length was 6.
      The maximum record length was 6.

Or copy the file to a physical file and read that file. Then SAS will detect the BOM and not treat it as part of the real content of the file.

filename from zip 'c:\downloads\bom.zip' member='bom.txt' ;
filename to temp;
%let rc=%sysfunc(fcopy(from,to));

data _null_;
  infile to;
  input;
  put _infile_ $hex12.;
  list;
  stop;
run;

Results:

1510  filename from zip 'c:\downloads\bom.zip' member='bom.txt' ;
1511  filename to temp;
1512  %let rc=%sysfunc(fcopy(from,to));
1513
1514  data _null_;
1515    infile to;
1516    input;
1517    put _infile_ $hex12.;
1518    list;
1519    stop;
1520  run;

NOTE: A byte-order mark in the file "...\#LN00086"
      (for fileref "TO") indicates that the data is encoded in "utf-8".  This encoding will be used to process the file.
NOTE: The infile TO is:
      Filename=...\#LN00086,
      RECFM=V,LRECL=131068,File Size (bytes)=8,
      Last Modified=26Jan2022:22:42:48,
      Create Time=26Jan2022:22:42:48

787878
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         xxx 3
NOTE: 1 record was read from the infile TO.
      The minimum record length was 3.
      The maximum record length was 3.

 

franriv
Obsidian | Level 7
Thanks! I'll look further into the BOM.
BOM explains weird caracter in first line read, but I still don't get why subsequent records have problem too.
Tom
Super User Tom
Super User

@franriv wrote:
Thanks! I'll look further into the BOM.
BOM explains weird caracter in first line read, but I still don't get why subsequent records have problem too.

Probably because you are treating them with the wrong encoding.  If the file really is using UTF-8 encoding there might be some "characters" in it that require more than one byte in the line.  If your current SAS session is using a single byte encoding, like WLATIN1 or LATIN1 then those will look like multiple character instead of one.

 

Try changing the ENCODING= option on the INFILE statement.  Most likely you want to set it to UTF-8.  Not sure how that will impact the interpretation of the three byte BOM.  Try it and find out.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 600 views
  • 1 like
  • 2 in conversation