I saw this post—Using FILENAME ZIP to unzip and read data files in SAS—that unzips a file inside a ZIP file as follows.
/* identify a temp folder in the WORK directory */
filename xl "%sysfunc(getoption(work))/sas_tech_talks_15.xlsx" ;
/* hat tip: "data _null_" on SAS-L */
data _null_;
/* using member syntax here */
infile inzip(sas_tech_talks_15.xlsx)
lrecl=256 recfm=F length=length eof=eof unbuf;
file xl lrecl=256 recfm=N;
input;
put _infile_ $varying256. length;
return;
eof:
stop;
run;
proc import datafile=xl dbms=xlsx out=confirmed replace;
sheet=confirmed;
run;
and I found that the following minimized code also works.
data _null_;
infile zipfile(member) recfm=f;
file target recfm=n;
input;
put _infile_;
run;
(1) I use an END= option sometimes, but don't use E:STOP; (with EOF=) and RETURN; in most cases. I found that the code above works as well without EOF, RETURN, STOP statements, but what are their functions here? And are they necessary to prevent potential errors?
(2) Also I hardly use RECFM=F and RECFM=N in other situations than unzipping, but I found that the code doesn't work without these RECFM statements. What are their exact functions in this context?
Thanks for your help.
Let's talk about the second issue first as it is the more useful. Normally SAS will read and write text files as varying length lines. Where the end of the line is marked by an end of line character. This normally a linefeed (aka LF aka '0A'x) on Unix and a carriage return (aka CR aka '0D'x) and linefeed pair on Windows. (See the TERMSTR= option on the INFILE or FILE statement).
But an XLSX file is not a text file and does not have lines. You do not want to interpret bytes that contain '0A'x as a linefeed or those that contain '0D'x as a carriage return. By telling SAS to use RECFM=F you are telling it to instead read/write the file as records (that do not have any end of line or "separator" characters between them). So if you ask it to read a file using RECFM=F and LRECL=256 it will read the first 256 bytes. then the next 256 bytes etc. The RECFM=N is more general in that it does not read a fixed number of bytes but it also does not attempt to interpret anything as being end of line or record separators. Which to use probably depends on combination of what makes sense for you file (was it really created using fixed length records?) and which performs better.
What is the difference between the END= and EOF= options? The END= option lets you name a variable that will set to true (1) when you have reached the end of file. The EOF= option lets you give it the name of statement label where program control should jump when you reach the end of the file. For most uses of the INFILE statement you don't need either. In that case what happens is when your INPUT statement tries to read past the end of the file the data step stops right there. You probably don't need to use either for this application since you don't want to do anything other than stop when you hit the end of the file.
What is a RETURN statement? It is like the end of a subroutine call. If there was no subroutine call then it will just end this iteration (sort of like returning to the top of the data step). Having the return statement in that data step prevents you from running the STOP statement on every iteration of the data step. The only way to get to the STOP statement is to jump to the label EOF. That is what the EOF=EOF option on the INFILE statement is telling SAS to do when the INPUT statement reads the end of the file.
Let's talk about the second issue first as it is the more useful. Normally SAS will read and write text files as varying length lines. Where the end of the line is marked by an end of line character. This normally a linefeed (aka LF aka '0A'x) on Unix and a carriage return (aka CR aka '0D'x) and linefeed pair on Windows. (See the TERMSTR= option on the INFILE or FILE statement).
But an XLSX file is not a text file and does not have lines. You do not want to interpret bytes that contain '0A'x as a linefeed or those that contain '0D'x as a carriage return. By telling SAS to use RECFM=F you are telling it to instead read/write the file as records (that do not have any end of line or "separator" characters between them). So if you ask it to read a file using RECFM=F and LRECL=256 it will read the first 256 bytes. then the next 256 bytes etc. The RECFM=N is more general in that it does not read a fixed number of bytes but it also does not attempt to interpret anything as being end of line or record separators. Which to use probably depends on combination of what makes sense for you file (was it really created using fixed length records?) and which performs better.
What is the difference between the END= and EOF= options? The END= option lets you name a variable that will set to true (1) when you have reached the end of file. The EOF= option lets you give it the name of statement label where program control should jump when you reach the end of the file. For most uses of the INFILE statement you don't need either. In that case what happens is when your INPUT statement tries to read past the end of the file the data step stops right there. You probably don't need to use either for this application since you don't want to do anything other than stop when you hit the end of the file.
What is a RETURN statement? It is like the end of a subroutine call. If there was no subroutine call then it will just end this iteration (sort of like returning to the top of the data step). Having the return statement in that data step prevents you from running the STOP statement on every iteration of the data step. The only way to get to the STOP statement is to jump to the label EOF. That is what the EOF=EOF option on the INFILE statement is telling SAS to do when the INPUT statement reads the end of the file.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.