I have a code that works well to loop over text files where the full path to each file (/folder1/folder2/textfile.txt) is saved into a dataset (filepaths) under the variable filepath. All text files are imported into a single dataset, using a single column where each row corresponds to each line in the text file.
data want;
set filepaths;
infile dummy filevar=filepath termstr=lf truncover end=done;
count=0;
do until (done);
input myvar $500.;
count + 1;
if prxmatch("&some_condition", _infile_) then output;
end;
run;
The issue is that when I try to generalize the code to read files whether they have CR or LF line breaks, the loop breaks down. This is the infile substitution I use, thanks to @Tom 's suggestion in an earlier post (works well for single files):
infile dummy filevar=filepath recfm=n dlm='0A0D'x dsd end=done;
I think the issue is that the END is not defined well with this method, it seems. I get a Note in the log about "Unexpected end of file for binary input."
I've tried various approaches, including a more complicated code with CALL EXECUTE, but the code was getting rather complicated and generating multiple datasets which then would have to be merged back into one. I was wondering if there is a simple solution to fix this loop instead of resorting to a messier work around with multiple unnecessary datasets.
You don't want to use DO UNTIL() since that will fail on empty files. Instead use DO WHILE(NOT ...) .
But I suspect using RECFM=N is messing with the ability of the data step to detect the end of the current file before it has already read past it.
So instead use an aggregate fileref that you can build from the list of files.
filename code temp;
data _null_;
set filepaths end=eof;
file code ;
if _n_=1 then put 'filename all (' ;
put ' ' filepath :$quote.;
if eof then put ');' ;
run;
%include code;
data want;
infile all recfm=n dlm='0D0A'x ;
count+1;
input myvar :$500.;
if prxmatch("&some_condition", _infile_) then output;
run;
You don't want to use DO UNTIL() since that will fail on empty files. Instead use DO WHILE(NOT ...) .
But I suspect using RECFM=N is messing with the ability of the data step to detect the end of the current file before it has already read past it.
So instead use an aggregate fileref that you can build from the list of files.
filename code temp;
data _null_;
set filepaths end=eof;
file code ;
if _n_=1 then put 'filename all (' ;
put ' ' filepath :$quote.;
if eof then put ');' ;
run;
%include code;
data want;
infile all recfm=n dlm='0D0A'x ;
count+1;
input myvar :$500.;
if prxmatch("&some_condition", _infile_) then output;
run;
Thanks Tom. DO WHILE did not make a difference in this case.
I think I understand the logic behind that filename method (I think you meant to write INFILE CODE instead of ALL?).
Couple issues:
@Excelsius wrote:
Thanks Tom. DO WHILE did not make a difference in this case.
I think I understand the logic behind that filename method (I think you meant to write INFILE CODE instead of ALL?).
Absolutely NOT.
The file CODE contains the SAS code generated by the first data step to define the fileref ALL that the next step will read. The %INCLUDE statement then sources the contents of CODE so that the FILENAME statement is executed.
If you want to count per file then add the FILENAME= option to the INFILE statement and reset the count when you start a new file.
data want;
infile all recfm=n dlm='0D0A'x filename=fname;
count+1;
input myvar :$500.;
if fname ne lag(fname) then count=1;
if prxmatch("&some_condition", _infile_) then output;
run;
I think this should work, even though it's essentially two data steps instead of one.
The number of steps required should not be of any concern at all. Look for a solution that solves the problem, takes a reasonable amount of time to execute and is maintainable.
@Excelsius wrote:
Generally I agree, but if the code is longer than it has to be, it can be more work to maintain. In this case, the filename option is also about 3 times slower than my original single step DO LOOP code. I'll continue looking for ways to optimize this. If anyone knows of a way to make my original loop code work, I'd still be curious to know. Another user suggested EOF= option for unbuffered data, but I could not find a good description of its usage in SAS documentation.
You might try the OPEN=DEFER option on INFILE statement and see if that improves the speed.
The EOF= option appears to work.
The EOF= option on the INFILE statement is used to set the LABEL of the statement that control should be transferred to when you try to read past the end of the file. So you could use that to jump past the DO loop so that it proceeds to the next iteration of the data step and reads the next filename from the dataset with the list of filenames.
data want;
set filepaths;
infile dummy filevar=filepath dsd dlm='0D0A'x recfm=n eof=done;
do count=1 by 1 ;
input myvar ~:$char500.;
if prxmatch("&some_condition", _infile_) then output;
end;
done:
run;
The first line names the dataset being created. The second reads the data with the file of filenames to read.
The INFILE statement says to use the FILEPATH variable as the filename to be read. Read the file using RECFM=N option and parse the strings using CR or LF as delimiter. The DSD option will allow to detect the empty lines. The EOF= names the label statement to jump to when you read past the current file.
The DO loop sets up an infinite loop that increments COUNT once every time through the loop. The EOF= option is what will prevent the loop from actually being infinite. It will end when the current file is finished being read.
The INPUT statement reads the next "word" from the file. The : modifier makes sure to use LIST MODE input. The ~ modifier makes sure that quotes are NOT removed from around a "word". The $CHAR informat preserve the leading spaces in the "word". Since MYVAR was not previously defined the 500 width on the $CHAR informat will force SAS to guess that MYVAR should be defined as length 500.
The line that starts with DONE and ends with a colon is the label for where to jump when the end of the file is reached. Since it is right before end of the data step the current iteration ends when the files is finished being read.
Thanks, this looks like it could work possibly. I have to do some experimenting. My problem was that I could not figure out the correct usage for the EOF= option. A question: do you know where in SAS documentation EOF is actually explained, maybe with couple examples? I would like to read it. I had looked here for example, but the laconic explanation for EOF= is woefully insufficient there. I have not been able to find any other sources on this option.
To learn about using the EOF= option you should read about
GOTO statement (not sure why the page shows it as GO TO instead)
Also check out
When the delimiter is a string shouldn't you use DLMSTR= infile statement option?
If END= is not working suggests the file is unbuffered and the EOF= option should be used.
I haven't tried to test any of this.
@Tom wrote:
You don't want to use DO UNTIL() since that will fail on empty files. Instead use DO WHILE(NOT ...) .
But I suspect using RECFM=N is messing with the ability of the data step to detect the end of the current file before it has already read past it.
So instead use an aggregate fileref that you can build from the list of files.
filename code temp; data _null_; set filepaths end=eof; file code ; if _n_=1 then put 'filename all (' ; put ' ' filepath :$quote.; if eof then put ');' ; run; %include code; data want; infile all recfm=n dlm='0D0A'x ; count+1; input myvar :$500.; if prxmatch("&some_condition", _infile_) then output; run;
DLMSTR= requires that the exact string be matched. The request was to match either CR or LF. Which is the how the DLM= option works.
The real question is why do you have some files with CR and others with LF (and none with CRLF???).
Each OS assumes a specific character ends a record. If the file is not from your OS then you have to specify the correct terminator. Which is not going to work if attempting to read multiple files with different terminators.
Perhaps use a general system tool to replace CR with LF (or vice versa) so all the files have the same record terminator before reading into SAS.
If the goal is just to find matching lines why not just use an operating system command like grep?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.