How can I simultaneously read in multiple text files each with a diffe...

USCSS_Nostromo · Posted 07-02-2019 01:22 PM

Dear SAS community,

I need to simultaneously read in multiple text files, each of which has a different number of columns, and create a single data set in SAS. Please see the attached example data. The data is space delimited.

All of the files in the folder have the same file ending and suffix. The following code allows me to import all the text files when the total number of columns is known. In the case of my example code, it is 16 columns. Unfortunately, my data no longer has a known number of columns.

I need to create columns and variables that include information included in the file name. I have been successful with this. Please have a look at the code here.

I am operating SAS version 9.4

data MyData;

length filename fname $254;
infile 'C:\RES999_BNR*.TXT' dlm=' ' filename=fname firstobs=2 ;

filename=scan(fname,-2,'./\');
ID =scan(filename,1,'_','m');
Night =scan(filename,2,'_');

length var1-var16 $ 16.;
input var1$ var2-var16;
spectra = dequote(var1);

drop var1;

run;

Reeza · Posted 07-02-2019 01:37 PM

What do you mean when they don't all have the same amount of columns? For the missing columns, are they all at the end or can the order vary as well. Are the columns missing entirely?

How do you read a single file right now?

Tom · Posted 07-02-2019 01:50 PM

The posted code will not work properly if the files have headers. You skip the header line on the first file but not the later files.

It also will not work properly if they don't have headers, since then skipping the first line misses one line of data.

You need to explain how the number of columns varies to see if there is a solution.

USCSS_Nostromo · Posted 07-03-2019 05:33 AM

Hi again,

The data are from EEG recordings of varying lengths. The length of the recording determines how many columns will be present in a given data set. There are always 321 rows of data. I don't need the header with the X prefixes. All subsequent rows and columns I need.

Best,

Ian

Tom · Posted 07-03-2019 08:55 AM

@USCSS_Nostromo wrote:

Hi again,

The data are from EEG recordings of varying lengths. The length of the recording determines how many columns will be present in a given data set. There are always 321 rows of data. I don't need the header with the X prefixes. All subsequent rows and columns I need.

Best,

Ian

You should have included that detail in the original posting. So instead of mixed number of totally different variables you have a varying number of repeats of the same variable. Personally I would read the data into a tall format.

Does the data contain any missing values in the middle of a line of values? If not you can just loop until you read a missing value.

So let's assume the files all have a header row (that you want to ignore) and that also have a first column with some information that identifies the row. So first read that row id value and then loop reading one value at a time from the row.

So something like this:

data want ;
   length fname filename $200 ;
   infile 'myfiles*.txt' dsd truncover filevar=fname ;
   input rowid $ @ ;
   if fname ne lag(fname) then delete;
   filename=fname;
   do col=1 by 1 until(value=.);
     input value @ ;
     if col=1 or value ne . then output;
   end;
run;

If there can be misisng values in the middle of the row then there are other ways to stop the DO loop. You could count the number of values in the row by looking at the automatic variable _INFILE_.

do col=1 to countw(_infile_,',')-1;

If the line length for any row is ever greater than 32K bytes then there are ways using the LENGTH= and COL= options on the INFILE statement to detect when you have reached the end of a line.

USCSS_Nostromo · Posted 07-04-2019 06:21 AM

Hi Tom,

Thank you for your reply.

I ran the code and received this error message. What is your opinion?

13
14 data want ;
15 length fname filename $200 ;
16 infile 'C:\RES*.TXT' dsd
16 ! truncover filevar=fname ;
ERROR: A Physical file reference (i.e. "PHYSICAL FILE REFERENCE" ) or an aggregate file storage
reference (i.e. AGGREGATE(MEMBER) ) reference cannot be used with the FILEVAR= option.
17 input rowid $ @ ;
18 if fname ne lag(fname) then delete;
19 filename=fname;
20 do col=1 by 1 until(value=.);
21 input value @ ;
22 if col=1 or value ne . then output;
23 end;
24 run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0
observations and 4 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds

Tom · Posted 07-04-2019 10:10 AM

Need the FILENAME= option instead of the FILEVAR= option.

TomKari · Posted 07-02-2019 02:15 PM

Here's a code snippet that should help you with the issue of getting the variable number of columns into variables. Note that the first data step is just to create some text data to test on.

Tom

data _null_;
	file "C:\some file.txt";
	Rec = "abc 123 456 789 012 345 678 901 123 456 789 012 345 678 901 345"; put Rec;
	Rec = "def 123 456 789 012 345 678 901 123 456 789 012 345 678 901 345"; put Rec;
	Rec = "ghi 123 456 789 012 345 678 901 123 456 789 012 345 678 901 345"; put Rec;
	Rec = "jkl 123 456 789 012 345 678 901 123 456 789 012 345 678 901 345"; put Rec;
run;

data MyData(drop=_:);
	infile C:\some file.txt';
	length _InRec $32767;
	length var01-var50 $ 16.;
	array _RecVars{50} var01-var50;
	input;
	_InRec = _infile_;
	_VarCount = countw(_InRec, " ");

	do _i = 1 to _VarCount;
		_RecVars{_i} = scan(_InRec, _i, " ");
	end;
run;

USCSS_Nostromo · Posted 07-03-2019 05:24 AM

Hi Tom,

Thank you once again for your help.

Your code works for one file that I read in as a test. But I noticed a problem that I was previously unaware of. Some of my data utilizes scientific notation, such as the following 1.55042962215361e-05, whereas other other data in the same file, the majority of the data, are expressed purely numerically.

Once imported, cells are left empty, and some data is scrambled.

Sincerely,

Ian

Tom · Posted 07-03-2019 08:59 AM

SAS will read a value like 1.55042962215361e-05 as a number.

Show the code and an example line of data that it is having trouble with.

Patrick · Posted 07-03-2019 09:06 AM

@USCSS_Nostromo

You can likely get what you need by choosing suitable informats and formats.

data test;
  attrib myvar informat=best32. format=best32.;
  input myvar;
  datalines;
1.55042962215361e-05
1.5
2
;
proc print data=test;
run;

TomKari · Posted 07-04-2019 12:31 PM

I've modified it to input the number using the BEST format. See if this advances your progress.

Tom

data _null_;
	file "C:\Some file.txt";
	Rec = "abc 123 456 789 012 345 678 901 1.55042962215361e-05 456 789 012 345 678 901 345"; put Rec;
	Rec = "def 123 456 789 012 345 678 901 123 456 789 012 345 1.55042962215361e5 901 345"; put Rec;
	Rec = "ghi 123 456 789 -1.55042962215361e-05 345 678 901 123 456 789 012 345 678 901 345"; put Rec;
	Rec = "jkl 123 456 789 012 345 678 901 123 456 -1.55042962215361e05 012 345 678 901 345"; put Rec;
run;

data MyData(drop=_:);
	infile "C:\Some file.txt";
	length _InRec $32767;
	length var01-var50 8;
	array _RecVars{50} var01-var50;
	input;
	_InRec = _infile_;
	_VarCount = countw(_InRec, " ");

	do _i = 1 to _VarCount;
		_RecVars{_i} = input(scan(_InRec, _i, " "), best32.);
	end;
run;

USCSS_Nostromo · Posted 07-09-2019 07:29 AM

Thank you for this. I have a quick question regarding these lines of code:

length var01-var50 8;
array _RecVars{50} var01-var50;

Is it necessary that I specify a finite range of variables in that array? Because each of my files has a different number of variables. I increased to 2000 variables and received this error message:

NOTE: 1 record was read from the infile "P:\PTSD\4. Data\EEG\EEG
revision\power_spectra\txtfiles_with_scoring\Test\RES*.TXT".
The minimum record length was 16196.
The maximum record length was 16196.
NOTE: Mathematical operations could not be performed at the following places. The results of the
operations have been set to missing values.
Each place is given by: (Number of times) at (Line):(Column).
2001 at 63:24
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MYDATA may be incomplete. When this step was stopped there were 0
observations and 2000 variables.
WARNING: Data set WORK.MYDATA was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.39 seconds
cpu time 0.36 seconds

Do you have any advice on how to proceed?

Tom · Posted 07-09-2019 08:13 AM

Not sure why you are making the array, but if you are concerned it will not be large enough then just make it larger than you expect to need.

You haven't shown enough of the log to see what error you are having.

Also you need to use one of the Insert Code buttons in the editor when pasting in log so that formatting is not removed.

In general notes about missing values will not cause a data step to stop.

I am a little concerned that your min and max record lengths are the same. That is not normal in a delimited text file, but perhaps it is just a side effect of stopping the data step early. Perhaps it only read in one line.

TomKari · Posted 07-09-2019 03:24 PM

If the problem is that there are more than 50 occurrences, a prior step to count the max number of occurrences should fix you up.

Tom

data _null_;
	file "C:\Some file.txt";
	Rec = "987 123 456 789 012 345 678 901 1.55042962215361e-05 456 789 012 345 678 901 345"; put Rec;
	Rec = "654 123 456 789 012 345 678 901 123 456 789 012 345 1.55042962215361e5 901 345"; put Rec;
	Rec = "321 123 456 789 -1.55042962215361e-05 345 678 901 123 456 789 012 345 678 901 345"; put Rec;
	Rec = "098 123 456 789 012 345 678 901 123 456 -1.55042962215361e05 012 345 678 901 345"; put Rec;
run;

data _null_;
	retain _MaxVarCount 0;
	infile "C:\Some file.txt" end=_LastRec;
	input;
	_VarCount = countw(_infile_, " ");

	if _VarCount > _MaxVarCount then
		_MaxVarCount = _VarCount;

	if _LastRec then
		call symput("VarCount", strip(put(_VarCount, best15.)));
run;

data MyData(drop=_:);
	infile "C:\Some file.txt";
	length _InRec $32767;
	length var01-var&VarCount. 8;
	array _RecVars{&VarCount.} var01-var&VarCount.;
	input;
	_InRec = _infile_;
	_VarCount = countw(_InRec, " ");

	do _i = 1 to _VarCount;
		_RecVars{_i} = input(scan(_InRec, _i, " "), best32.);
	end;
run;

How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Re: How can I simultaneously read in multiple text files each with a different number of columns?

Registration is open