Re: Append All TXT Files in a Folder to a Dataset

BCNAV · Posted 02-05-2018 01:06 PM

I'm thinking this is easy....but for me not so much

If I have a folder of .txt files that have the same variables and delimiters, I would like to append them all into one SAS dataset. Is there a way to pattern match (*.txt) and use proc append after reading in?

E.g.,

Text files:

1.txt

2.txt

3.txt...

n.txt

read all *.txt in and then append them all.

Cheers!

Reeza · Posted 02-05-2018 01:17 PM

https://communities.sas.com/t5/SAS-Communities-Library/How-do-I-write-a-macro-to-import-multiple-tex...

error_prone · Posted 02-06-2018 02:47 AM

You can use a wildcard in the infile-statement:

data all_files;
infile "folder/*.txt" further options;
Input ....;
Run;

BCNAV · Posted 02-06-2018 10:18 AM

I have this nearly figured out. I can get all the data in, the only issue is that every time a new txt file is read in (even the first one), a blank line is inserted. Here is the code thus far; I am not including the format coding for the variables nor all the variables as it would take too much space:

* Use wildcard in input;
	infile "H:\2007\*.txt"  FIRSTOBS=1 eov=eov missover filename=filename ENCODING="WLATIN1" DLM='09'x TRUNCOVER DSD;

	* Input first record and hold line;
	input@;
 
	* Check if this is the first record or the first record in a new file;
	* If it is, replace the filename with the new file name and move to next line;
	if _n_ eq 1 or eov then do;
		txt_file_name = scan(filename, -1, "\");
		eov=0;
	end;
	
	else input
		FlightId             : $CHAR25.
                AirlineCode      : $CHAR3.
                FlightNumber   : $CHAR7.
                SchedDepApt   : $CHAR4.
                SchedArrApt     : $CHAR4.

                many other variables
;
run;

If I can get rid of the blank line that starts each input of the txt files that would get it all!

thx

Reeza · Posted 02-06-2018 11:20 AM

Do your files have a header row? Where is the ‘empty line’ coming from?

BCNAV · Posted 02-06-2018 11:35 AM

The first line is a header line with the variable names in it. I had thought FIRSTOBS=2 would fix it, but it did not, hence I chose FIRSTOBS=1 to test to see if it made a difference. It does. FIRSTOBS=2 will cause it to read the second valid record, which is not good. Hence I am back to FIRSTOBS=1.

The first row in every text file is the header row with the name of the variables. It should be ignored. If I use SAS EG to read it in, it uses FIRSTOBS=2 and this ignores the header row properly. EG uses:

INFILE 'H:\2007\NavCanReport_20171002.txt'
        LRECL=32767
        FIRSTOBS=2
        ENCODING="WLATIN1"
        DLM='09'x
        MISSOVER
        DSD ;
    INPUT
        FlightId         : $CHAR25.

lots of vars

This works fine. Obviously I would not know LRECL when grabbing a bunch of text files. Seems like there is something wrong with the code I am using here as it seems to always add a blank line upon reading each file. There are no blank lines in the txt files. Likely my:

infile "H:\2007\*.txt"  FIRSTOBS=2 eov=eov missover filename=filename ENCODING="WLATIN1" DLM='09'x TRUNCOVER DSD;

	* Input first record and hold line;
	input@;
 
	* Check if this is the first record or the first record in a new file;
	* If it is, replace the filename with the new file name and move to next line;
	if _n_ eq 1 or eov then do;
		txt_file_name = scan(filename, -1, "\");
		eov=0;
	end;
	
	else input

needs some fixing, but this is just code from other posts here. Perhaps the "move to next line" stuff above? That implies to keep the first row, which I do not want to do as it is a header of variable names.

Reeza · Posted 02-06-2018 12:09 PM

No, the code is designed to account for a header row. Follow it exactly. Do you have the two INPUT statements?

Reeza · Posted 02-06-2018 12:11 PM

You have both MISSOVER and TRUNCOVER, pick one.

Remove the FIRSTOBS and other options and see if that works.

BCNAV · Posted 02-06-2018 12:35 PM

@Reeza

The other options I am using are needed as SAS will crash with error codes. They are options that EG adds when manually adding the txt files (we could have hundreds hence the need for some code to bring this in no matter the number). I also have two input statements just as listed before. The blank initial line is still there for some reason. All the data flows properly after that and for each file. Just a blank line is inserted before each new file is read.

error_prone · Posted 02-06-2018 12:28 PM

The option firstobs affects the first file only. Just remove it. Use EOV=newFile in infile statement and
if not newFile;
afterwards.

BCNAV · Posted 02-06-2018 12:38 PM

@error_prone

What exactly do you mean...I have:

data all_data;
     length
              specify length of all vars
     format
              specify format of all vars
     informat
              commands

*make sure variables to store file name are long enough;
	length filename txt_file_name $256;

	*keep file name from record to record;
	retain txt_file_name;

 	* Use wildcard in input;
	infile "H:\2007\*.txt" eov=eov filename=filename ENCODING="WLATIN1" DLM='09'x MISSOVER DSD;

	* Input first record and hold line;
	input@;
 
	* Check if this is the first record or the first record in a new file;
	* If it is, replace the filename with the new file name and move to next line;
	if _n_ eq 1 or eov then do;
		txt_file_name = scan(filename, -1, "\");
		eov=0;
	end;
	
	else input
		FlightId         : $CHAR25.
        AirlineCode      : $CHAR3.
        FlightNumber     : $CHAR7.
        SchedDepApt      : $CHAR4.

many vars to the end;

How would I implement your idea?

BCNAV · Posted 02-08-2018 12:56 PM

So...an update. None of the programming listed prior worked for me...always problems with the headers. What did work perfectly is:

infile "\\NCRFP4\TAF\Traffic Forecasting\OAG\Cancellation Data\NavCan\*.txt" eov=eov ENCODING="WLATIN1" DLM='09'x MISSOVER DSD firstobs=2;
	input @;
	if eov then input;
	
	input
              var1
              var 2, for many variables
;
eov=0;
run;

Not sure why the others did not.

Tom · Posted 02-08-2018 10:40 PM

Your problem is including the resetting of the EOV inside of the THEN clause. SAS sets it to 1 when it starts a new file, but it never sets it back to 0.

Note that the code that PROC IMPORT and EG Import function generates is not really that good. Your basic structure of first defining your variables using LENGTH statement is good, but :

Do not both attach permanent INFORMATs with an INFORMAT statement and also specify an informat in the INPUT statement. Unless you really really want to read this file using a different informat than you want attached to the variable.

SAS does NOT need to have either INFORMATs or FORMATS defined for most variables. SAS already knows how to read and write most numbers and character strings. Main exception are date/time/datetime values. You might also need to use COMMA informat if the text file is a report that includes commas and dollar signs in the values of numbers.

The main advantage of using the $CHAR informat (and format) is to preserve leading spaces in the values. Do really WANT to preserve leading spaces in the values of your character variables? If so I am not sure that you can if you are using the DSD option to read.

Classroom Training Available!