Solved: Re: Looping over data files with different line breaks using infile

Excelsius · Posted 05-24-2022 11:37 AM

I have a code that works well to loop over text files where the full path to each file (/folder1/folder2/textfile.txt) is saved into a dataset (filepaths) under the variable filepath. All text files are imported into a single dataset, using a single column where each row corresponds to each line in the text file.

data want;
	set filepaths;
	infile dummy filevar=filepath termstr=lf truncover end=done;
	count=0;
	do until (done);
		input myvar $500.; 
		count + 1;
		if prxmatch("&some_condition", _infile_) then output;
	end;
run;

The issue is that when I try to generalize the code to read files whether they have CR or LF line breaks, the loop breaks down. This is the infile substitution I use, thanks to @Tom 's suggestion in an earlier post (works well for single files):

	infile dummy filevar=filepath recfm=n dlm='0A0D'x dsd end=done;

I think the issue is that the END is not defined well with this method, it seems. I get a Note in the log about "Unexpected end of file for binary input."

I've tried various approaches, including a more complicated code with CALL EXECUTE, but the code was getting rather complicated and generating multiple datasets which then would have to be merged back into one. I was wondering if there is a simple solution to fix this loop instead of resorting to a messier work around with multiple unnecessary datasets.

Tom · Posted 05-24-2022 12:01 PM

You don't want to use DO UNTIL() since that will fail on empty files. Instead use DO WHILE(NOT ...) .

But I suspect using RECFM=N is messing with the ability of the data step to detect the end of the current file before it has already read past it.

So instead use an aggregate fileref that you can build from the list of files.

filename code temp;
data _null_;
  set filepaths end=eof;
  file code ;
  if _n_=1 then put 'filename all (' ;
  put ' ' filepath :$quote.; 
  if eof then put ');' ;
run;
%include code;

data want;
  infile all recfm=n dlm='0D0A'x ;
  count+1;
  input myvar :$500.; 
  if prxmatch("&some_condition", _infile_) then output;
run;

View solution in original post

Tom · Posted 05-24-2022 12:01 PM

You don't want to use DO UNTIL() since that will fail on empty files. Instead use DO WHILE(NOT ...) .

But I suspect using RECFM=N is messing with the ability of the data step to detect the end of the current file before it has already read past it.

So instead use an aggregate fileref that you can build from the list of files.

filename code temp;
data _null_;
  set filepaths end=eof;
  file code ;
  if _n_=1 then put 'filename all (' ;
  put ' ' filepath :$quote.; 
  if eof then put ');' ;
run;
%include code;

data want;
  infile all recfm=n dlm='0D0A'x ;
  count+1;
  input myvar :$500.; 
  if prxmatch("&some_condition", _infile_) then output;
run;

Excelsius · Posted 05-24-2022 02:20 PM

Thanks Tom. DO WHILE did not make a difference in this case.

I think I understand the logic behind that filename method (I think you meant to write INFILE CODE instead of ALL?).

Couple issues:

I'm not getting any output when utilizing the data _null_ step, but I can get the infile statement to work if I manually create the filename by pasting individual paths. I'm wondering if it's generating the correct file syntax. Is there a way to print the contents of CODE to check this?
When using the manual method, I can see that the line counts are off--it continuously counts all files, instead of starting from zero for each input file as I have in my original code with the do loop. Using something like "if eof then count=0" didn't work probably for the same reason as before.

Tom · Posted 05-24-2022 02:26 PM

@Excelsius wrote:

Thanks Tom. DO WHILE did not make a difference in this case.

I think I understand the logic behind that filename method (I think you meant to write INFILE CODE instead of ALL?).

Absolutely NOT.

The file CODE contains the SAS code generated by the first data step to define the fileref ALL that the next step will read. The %INCLUDE statement then sources the contents of CODE so that the FILENAME statement is executed.

If you want to count per file then add the FILENAME= option to the INFILE statement and reset the count when you start a new file.

data want;
  infile all recfm=n dlm='0D0A'x filename=fname;
  count+1;
  input myvar :$500.; 
  if fname ne lag(fname) then count=1;
  if prxmatch("&some_condition", _infile_) then output;
run;

Excelsius · Posted 05-24-2022 03:33 PM

After some tweaking, I got the numbers working. Thanks for that tip. I had to mess with variable lengths that were preventing correct file read resolutions.

And I had missed your include statement, that's why it wasn't making sense.
I think this should work, even though it's essentially two data steps instead of one. Another interesting implementation--really appreciate your tips and knowledge, Tom. If you aren't already, you should consider running a blog.

Tom · Posted 05-24-2022 10:22 PM

I think this should work, even though it's essentially two data steps instead of one.

The number of steps required should not be of any concern at all. Look for a solution that solves the problem, takes a reasonable amount of time to execute and is maintainable.

Excelsius · Posted 05-25-2022 02:32 PM

Generally I agree, but if the code is longer than it has to be, it can be more work to maintain. In this case, the filename option is also about 3 times slower than my original single step DO LOOP code. I'll continue looking for ways to optimize this. If anyone knows of a way to make my original loop code work, I'd still be curious to know. Another user suggested EOF= option for unbuffered data, but I could not find a good description of its usage in SAS documentation.

Tom · Posted 05-25-2022 11:14 PM

@Excelsius wrote:
Generally I agree, but if the code is longer than it has to be, it can be more work to maintain. In this case, the filename option is also about 3 times slower than my original single step DO LOOP code. I'll continue looking for ways to optimize this. If anyone knows of a way to make my original loop code work, I'd still be curious to know. Another user suggested EOF= option for unbuffered data, but I could not find a good description of its usage in SAS documentation.

You might try the OPEN=DEFER option on INFILE statement and see if that improves the speed.

The EOF= option appears to work.

The EOF= option on the INFILE statement is used to set the LABEL of the statement that control should be transferred to when you try to read past the end of the file. So you could use that to jump past the DO loop so that it proceeds to the next iteration of the data step and reads the next filename from the dataset with the list of filenames.

data want;
  set filepaths;
  infile dummy filevar=filepath dsd dlm='0D0A'x recfm=n eof=done;
  do count=1 by 1 ;
    input myvar ~:$char500.; 
    if prxmatch("&some_condition", _infile_) then output;
  end;
done:
run;

The first line names the dataset being created. The second reads the data with the file of filenames to read.

The INFILE statement says to use the FILEPATH variable as the filename to be read. Read the file using RECFM=N option and parse the strings using CR or LF as delimiter. The DSD option will allow to detect the empty lines. The EOF= names the label statement to jump to when you read past the current file.

The DO loop sets up an infinite loop that increments COUNT once every time through the loop. The EOF= option is what will prevent the loop from actually being infinite. It will end when the current file is finished being read.

The INPUT statement reads the next "word" from the file. The : modifier makes sure to use LIST MODE input. The ~ modifier makes sure that quotes are NOT removed from around a "word". The $CHAR informat preserve the leading spaces in the "word". Since MYVAR was not previously defined the 500 width on the $CHAR informat will force SAS to guess that MYVAR should be defined as length 500.

The line that starts with DONE and ends with a colon is the label for where to jump when the end of the file is reached. Since it is right before end of the data step the current iteration ends when the files is finished being read.

Excelsius · Posted 05-27-2022 04:29 PM

Thanks, this looks like it could work possibly. I have to do some experimenting. My problem was that I could not figure out the correct usage for the EOF= option. A question: do you know where in SAS documentation EOF is actually explained, maybe with couple examples? I would like to read it. I had looked here for example, but the laconic explanation for EOF= is woefully insufficient there. I have not been able to find any other sources on this option.

Tom · Posted 05-27-2022 06:51 PM

To learn about using the EOF= option you should read about

LABEL

GOTO statement (not sure why the page shows it as GO TO instead)

Also check out

LINK

RETURN

data_null__ · Posted 05-24-2022 04:38 PM

When the delimiter is a string shouldn't you use DLMSTR= infile statement option?

If END= is not working suggests the file is unbuffered and the EOF= option should be used.

I haven't tried to test any of this.

@Tom wrote:

You don't want to use DO UNTIL() since that will fail on empty files. Instead use DO WHILE(NOT ...) .

But I suspect using RECFM=N is messing with the ability of the data step to detect the end of the current file before it has already read past it.

So instead use an aggregate fileref that you can build from the list of files.
filename code temp;
data _null_;
  set filepaths end=eof;
  file code ;
  if _n_=1 then put 'filename all (' ;
  put ' ' filepath :$quote.; 
  if eof then put ');' ;
run;
%include code;

data want;
  infile all recfm=n dlm='0D0A'x ;
  count+1;
  input myvar :$500.; 
  if prxmatch("&some_condition", _infile_) then output;
run;

Tom · Posted 05-24-2022 10:19 PM

DLMSTR= requires that the exact string be matched. The request was to match either CR or LF. Which is the how the DLM= option works.

ballardw · Posted 05-24-2022 12:01 PM

The real question is why do you have some files with CR and others with LF (and none with CRLF???).

Each OS assumes a specific character ends a record. If the file is not from your OS then you have to specify the correct terminator. Which is not going to work if attempting to read multiple files with different terminators.

Perhaps use a general system tool to replace CR with LF (or vice versa) so all the files have the same record terminator before reading into SAS.

Excelsius · Posted 05-24-2022 02:30 PM

No control over the data files. Some are archival. I agree that another solution is to make the line breaks consistent; however, I did not find a simple solution to do so within SAS. Such a conversion would still require an infile method with TRANSLATE, thus the solution being implemented in this example already tried to achieve the same goal. Let me know if you're aware of a way to remove line breaks within the same data step while the file is being read.

Tom · Posted 05-24-2022 02:35 PM

If the goal is just to find matching lines why not just use an operating system command like grep?

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away