Solved: Read several text files into a data set in one data step

paul_e · Posted 01-18-2024 08:53 AM

Hello, I'm trying to read in a directory of text files into a data set that would hold all file names in one variable and the entire content of the respective text file in another variable. Let's consider this example:

1. A directory on my server /data/textfiles/ has this contents:

textfile1.txt

textfile2.txt

textfile3.txt

2. I would like to create a dataset that looks like this:

	fname	content
1	textfile1.txt	This is the entire text in this file. Line breaks might be deleted or replaced with special characters.
2	textfile2.txt	This is the entire text in this file. Line breaks might be deleted or replaced with special characters.
3	textfile3.txt	This is the entire text in this file. Line breaks might be deleted or replaced with special characters.

I've tried to program this with dread(), fread(), fget() and so on but haven't been successful.

%let directory=/data/textfiles/
data files;
     error_dir = filename(fref,"&directory");
     dir_id = dopen(fref);
     do i = 1 to dnum(dir_id);  
       fname = dread(dir_id,i);
       fpath = cat("&directory./",fname);
       error_file = filename("thefile",fpath);
       file_id = fopen("thefile");
       fread_error = fread(file_id);
       fget_error = fget(file_id,content);
       fclose_error = fclose(file_id);
       output;
     end;
     dclose_error = dclose(dir_id);
     keep fname content;
run;

However, what I'm getting is just the first few characters of each file, in my impression it's always the first line, i. e. line breaks are treated as separators and fget() only takes the first column from each opened file. The documentation for fget() is pretty thin and I don't see how to change the way data are written to the dataset from the file.

ballardw · Posted 01-19-2024 11:59 AM

@paul_e wrote:

The use case here is that I'd like to handle code files within SAS, for instance isolate single data steps in the code, which is way more difficult if the code is stored in separate rows. But I can see that this is probably not feasible with the variable length limit anyway. Thanks for your answers!

Still not a clear description of what "the entire content of the respective text file" would be BUT it would be much harder to even determine what a single Data step or other proc would be in such a mess.

If your code is "reasonably structured", meaning that a data step starts on a line with Data and the step ends with a line consisting of Run; (or a label and run;) or a Procedure starts with Proc and ends with Run; then reading the file line by line, adding a line number variable it would be easy to use a data step to extract a data step or procedure, or add a flag variable to indicate related lines.

For example (dummy code):

data mycodefiles;
   infile "path/*.sas" FILENAME = readfile <other infile options such as and EOV>;
  input line $100.; (or what seems likely as your longest code line);
  retain codegroup;
  if indexw (lowcase(line),'proc')>1 or strip(lowcase(line))=: 'data' then codegroup+1;
run;

details for handling comments and such needed and perhaps individuals search terms may be required.

Cation: this sort of "find code" is likely inappropriate for MACRO definitions.

View solution in original post

ballardw · Posted 01-18-2024 11:53 AM

What exactly do you mean by " entire content of the respective text file"? How much text do you actually expect in the entire content? SAS variables are limited in size.

Fread is going to treat file line delimiters, such as line feed or carriage return depending on file operating system, as end of record. Which operating system created the text files. You may be able to "trick" SAS into treating some line delimiters as not being one but is very file dependent. What would be so wrong about having multiple observations for each file as long as all the text is there?

What exactly do you expect to do with the resulting data set? That much text in a single variable seems like you may be looking at something more like the SAS Enterprise Miner for text analysis than basic data step approaches.

Tom · Posted 01-18-2024 01:12 PM

If you want to read the file as BINARY instead of TEXT then change

the FILENAME() function call to set attributes. You might try RECFM=N.

error_file_rc = filename("thefile",fpath,,'RECFM=N');

Or perhaps RECFM=F and LRECL=32767 since that is the maximum number of bytes you can store in a single variable.

error_file_rc = filename("thefile",fpath,,'RECFM=F LRECL=32767');

Or change the FOPEN() function call:

file_id = fopen("thefile",'I',32767,'B');

Tom · Posted 01-18-2024 01:18 PM

If you want to read all of the files in a directory there is no need to get so complicated.

Basic INFILE/INPUT statements will do that.

Since you say they are text files then read them as LINES.

data want;
  length fileno 8 fname filename $200 line 8 content $32767 truncover;
  infile '/data/textfiles/*'  filename=filename ;
  input @;
  fname = scan(filename,-1,'/');
  if fname ne lag(fname) then do; 
     fileno+1; line=0;
  end;
  line+1;
  input content $char32767. ;
run;

SASKiwi · Posted 01-18-2024 02:06 PM

What are you intending to use this data set for? If we understood your complete use case perhaps we could suggest a better solution.

paul_e · Posted 01-19-2024 08:33 AM

The use case here is that I'd like to handle code files within SAS, for instance isolate single data steps in the code, which is way more difficult if the code is stored in separate rows. But I can see that this is probably not feasible with the variable length limit anyway. Thanks for your answers!

ballardw · Posted 01-19-2024 11:59 AM

@paul_e wrote:

The use case here is that I'd like to handle code files within SAS, for instance isolate single data steps in the code, which is way more difficult if the code is stored in separate rows. But I can see that this is probably not feasible with the variable length limit anyway. Thanks for your answers!

Still not a clear description of what "the entire content of the respective text file" would be BUT it would be much harder to even determine what a single Data step or other proc would be in such a mess.

If your code is "reasonably structured", meaning that a data step starts on a line with Data and the step ends with a line consisting of Run; (or a label and run;) or a Procedure starts with Proc and ends with Run; then reading the file line by line, adding a line number variable it would be easy to use a data step to extract a data step or procedure, or add a flag variable to indicate related lines.

For example (dummy code):

data mycodefiles;
   infile "path/*.sas" FILENAME = readfile <other infile options such as and EOV>;
  input line $100.; (or what seems likely as your longest code line);
  retain codegroup;
  if indexw (lowcase(line),'proc')>1 or strip(lowcase(line))=: 'data' then codegroup+1;
run;

details for handling comments and such needed and perhaps individuals search terms may be required.

Cation: this sort of "find code" is likely inappropriate for MACRO definitions.

SASKiwi · Posted 01-20-2024 05:45 PM

I agree with @ballardw that your use case of "handle code files" and isolate DATA steps within SAS is still unclear. What would you do with an isolated DATA step? I find SAS macros a good way of isolating common functionality in SAS so it can be easily repeated. An example of this would be importing or exporting CSV files.

paul_e · Posted 01-23-2024 06:46 AM

The purpose of this is code analysis. And to process the code I first need it in a dataset. But I settled on reading the separate lines in separate observations which is way easier and actually has its benefits.

Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Re: Read several text files into a data set in one data step

Registration is open

SAS Training: Just a Click Away