Solved: Reading external unstructured .txt file into sas: how to read specific...

A_Kh · Posted 01-31-2023 03:23 PM

Dear Community,

I'm wondering if there is a way to read a specific part of a text file into SAS. Below is an example from the data, where I need to read only 1 to 26 rows under P(#). The remaining part is unnecessary. I could read it using proc import then reading 26 obs in the next data step, but when there are hundreds of files and observation numbers (to read) change at each run, this requires more manual effort. I'm looking for more optimal technique that allows to read data maybe based on data patterns.. Any idea or tips would be appreciated. Thank you!

P(#)         Est          SE        Grad
     1        0.30        0.05        0.02
     2        1.05        0.07       -0.01
     3        2.88        0.11        0.01
     4        1.23        0.11       -0.01
     5       -0.05        0.04        0.02
     6        0.84        0.06       -0.01
     7        0.94        0.05        0.02
     8        1.04        0.07       -0.01
     9        0.33        0.04        0.02
    10        0.85        0.06       -0.01
    11        2.09        0.09        0.02
    12        1.34        0.10       -0.02
    13        4.29        0.21        0.01
    14        1.90        0.17       -0.02
    15        1.51        0.06        0.02
    16        0.97        0.08       -0.01
    17        3.35        0.16        0.02
    18        1.89        0.15       -0.02
    19        2.40        0.09        0.02
    20        1.32        0.10       -0.02
    21       -0.26        0.04        0.02
    22        0.87        0.06       -0.01
    23        0.15        0.05        0.02
    24        1.01        0.07       -0.01
    25        0.00          --
    26        1.00          --
-2*log(LL) = 33408.05
 #Cycles    A-time    E-time    D-time    M-time    S-time     Total
      26      0.00      0.10      0.00      0.00      0.12      0.22

Parameter Segments
 Segment 1:
  Items= 1
  Parms= 1 2
 Segment 2:
  Items= 2
  Parms= 3 4

Tom · Posted 01-31-2023 05:31 PM

You can use the ?? input modifier to suppress the messages about invalid data.

data want;
  infile sample firstobs=2 truncover;
  input (P Est SE Grad) (??);
  if missing(p) then stop;
run;

You could also just read ALL of the files in one data step.

data all;
  length fileno 8 filename $256 ;
  do fileno=1 to 31 ;
    filename = cats("&newfolder\",symget(cats('file',fileno)),'.txt');
    infile text filevar=filename firstobs=2 end=eof;
    p=0;
    do while(not missing(p) and not eof);
      input (P Est SE Grad) (??);
      if not missing(p) then output;
    end;
  end;
run;

View solution in original post

Tom · Posted 01-31-2023 03:32 PM

There is no reason to use PROC IMPORT to read a file with only 4 variables. You can write the data step in less code than it would take to write the PROC IMPORT code. And you then have complete control over how it is read.

First let's convert your example back into a physical file:

Spoiler

options parmcards=sample;
filename sample temp;

parmcards4;
P(#)         Est          SE        Grad
     1        0.30        0.05        0.02
     2        1.05        0.07       -0.01
     3        2.88        0.11        0.01
     4        1.23        0.11       -0.01
     5       -0.05        0.04        0.02
     6        0.84        0.06       -0.01
     7        0.94        0.05        0.02
     8        1.04        0.07       -0.01
     9        0.33        0.04        0.02
    10        0.85        0.06       -0.01
    11        2.09        0.09        0.02
    12        1.34        0.10       -0.02
    13        4.29        0.21        0.01
    14        1.90        0.17       -0.02
    15        1.51        0.06        0.02
    16        0.97        0.08       -0.01
    17        3.35        0.16        0.02
    18        1.89        0.15       -0.02
    19        2.40        0.09        0.02
    20        1.32        0.10       -0.02
    21       -0.26        0.04        0.02
    22        0.87        0.06       -0.01
    23        0.15        0.05        0.02
    24        1.01        0.07       -0.01
    25        0.00          --
    26        1.00          --
-2*log(LL) = 33408.05
 #Cycles    A-time    E-time    D-time    M-time    S-time     Total
      26      0.00      0.10      0.00      0.00      0.12      0.22

Parameter Segments
 Segment 1:
  Items= 1
  Parms= 1 2
 Segment 2:
  Items= 2
  Parms= 3 4
;;;;

options parmcards=sample; filename sample temp; parmcards4; P(#) Est SE Grad 1 0.30 0.05 0.02 2 1.05 0.07 -0.01 3 2.88 0.11 0.01 4 1.23 0.11 -0.01 5 -0.05 0.04 0.02 6 0.84 0.06 -0.01 7 0.94 0.05 0.02 8 1.04 0.07 -0.01 9 0.33 0.04 0.02 10 0.85 0.06 -0.01 11 2.09 0.09 0.02 12 1.34 0.10 -0.02 13 4.29 0.21 0.01 14 1.90 0.17 -0.02 15 1.51 0.06 0.02 16 0.97 0.08 -0.01 17 3.35 0.16 0.02 18 1.89 0.15 -0.02 19 2.40 0.09 0.02 20 1.32 0.10 -0.02 21 -0.26 0.04 0.02 22 0.87 0.06 -0.01 23 0.15 0.05 0.02 24 1.01 0.07 -0.01 25 0.00 -- 26 1.00 -- -2*log(LL) = 33408.05 #Cycles A-time E-time D-time M-time S-time Total 26 0.00 0.10 0.00 0.00 0.12 0.22 Parameter Segments Segment 1: Items= 1 Parms= 1 2 Segment 2: Items= 2 Parms= 3 4 ;;;;

If you know the file always has at least 26 lines of data and you only want the first 26 then you can use the OBS= option:

data want;
  infile sample firstobs=2 obs=27 truncover;
  input p1-p4;
run;

If the number of lines varies then you could probably decide when to stop or what observations to write out based on the value read for the first column.

A_Kh · Posted 01-31-2023 04:19 PM

Hi @Tom ,

Thank you for your answer, it is very helpful! As I never used infile statement to read data before, don't know enough about it's power. What would be the code for reading the ready file (C:\Users\Files\dbg.text) in your example?
The number of observations vary per file, but data sample won't (eg. about first 25-50 obs are numeric values ordered in 4 columns, followed by obs containing unorganized lines of texts). It would be ideal to know also how to stop reading data once numeric values ends.

Kurt_Bremser · Posted 01-31-2023 04:20 PM

Read until you find a trigger to stop:

data want;
infile "path to your file" firstobs=2;
input @;
if index(_infile_,"=") then stop;
input p_ est se grad;
run;

The step will skip the header line and terminate when an equal sign is detected in the infile.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

A_Kh · Posted 01-31-2023 05:09 PM

Thank you, @Kurt_Bremser !

This is giving the desired output, but with multiple _ERROR_ in the log, due to different data type. Starting from row 27 SE variable has only -- (double dash) which is causing the error. Could this error be avoided by using any INFILE statement options?

P(#) Est SE Grad
1 -0.91 0.04 0.00
2 -1.01 0.04 0.00
3 -0.96 0.04 0.00
.
.
25 0.98 0.85 0.00
26 -1.30 1.32 0.00
27 0.00 --
28 0.00 --
29 0.00 --
30 0.00 --
.
.
89 0.00 --
90 0.00 --
91 1.00 --
-2*log(LL) = 85741.39
#Cycles A-time E-time D-time M-time S-time Total
535 0.00 2.69 0.00 0.48 0.61 3.78

Below is the SAS code and it's log.

%macro import;
	%do i= 1 %to 31;
		filename dbg_new "&newfolder.\&&file&i...txt";
		data new_%scan(&&file&i, -1, -)&i;
			infile dbg_new firstobs=2 truncover;
			input @;
			if index(_infile_,"=") then stop;
			input p_ est se grad;
		run;
	%end;
%mend; 

%import;

NOTE: Invalid data for se in line 28 29-30.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+--
28            27        0.00          -- 30
p_=27 est=0 se=. grad=. _ERROR_=1 _INFILE_=27        0.00          -- _N_=27
NOTE: Invalid data for se in line 29 29-30.
29            28        0.00          -- 30
p_=28 est=0 se=. grad=. _ERROR_=1 _INFILE_=28        0.00          -- _N_=28
NOTE: Invalid data for se in line 30 29-30.
30            29        0.00          -- 30
p_=29 est=0 se=. grad=. _ERROR_=1 _INFILE_=29        0.00          -- _N_=29
NOTE: Invalid data for se in line 31 29-30.
31            30        0.00          -- 30
p_=30 est=0 se=. grad=. _ERROR_=1 _INFILE_=30        0.00          -- _N_=30
NOTE: Invalid data for se in line 32 29-30.
32            31        0.00          -- 30
p_=31 est=0 se=. grad=. _ERROR_=1 _INFILE_=31        0.00          -- _N_=31
NOTE: Invalid data for se in line 33 29-30.
33            32        0.00          -- 30
p_=32 est=0 se=. grad=. _ERROR_=1 _INFILE_=32        0.00          -- _N_=32
NOTE: Invalid data for se in line 34 29-30.
34            33        0.00          -- 30
p_=33 est=0 se=. grad=. _ERROR_=1 _INFILE_=33        0.00          -- _N_=33
NOTE: Invalid data for se in line 35 29-30.
35            34        0.00          -- 30
p_=34 est=0 se=. grad=. _ERROR_=1 _INFILE_=34        0.00          -- _N_=34
NOTE: Invalid data for se in line 36 29-30.
36            35        0.00          -- 30
p_=35 est=0 se=. grad=. _ERROR_=1 _INFILE_=35        0.00          -- _N_=35
NOTE: Invalid data for se in line 37 29-30.
37            36        0.00          -- 30
p_=36 est=0 se=. grad=. _ERROR_=1 _INFILE_=36        0.00          -- _N_=36
NOTE: Invalid data for se in line 38 29-30.
38            37        0.00          -- 30
p_=37 est=0 se=. grad=. _ERROR_=1 _INFILE_=37        0.00          -- _N_=37
NOTE: Invalid data for se in line 39 29-30.
39            38        0.00          -- 30
p_=38 est=0 se=. grad=. _ERROR_=1 _INFILE_=38        0.00          -- _N_=38
NOTE: Invalid data for se in line 40 29-30.
40            39        0.00          -- 30
p_=39 est=0 se=. grad=. _ERROR_=1 _INFILE_=39        0.00          -- _N_=39
NOTE: Invalid data for se in line 41 29-30.
41            40        0.00          -- 30
p_=40 est=0 se=. grad=. _ERROR_=1 _INFILE_=40        0.00          -- _N_=40
NOTE: Invalid data for se in line 42 29-30.
42            41        0.00          -- 30
p_=41 est=0 se=. grad=. _ERROR_=1 _INFILE_=41        0.00          -- _N_=41
NOTE: Invalid data for se in line 43 29-30.
43            42        0.00          -- 30
p_=42 est=0 se=. grad=. _ERROR_=1 _INFILE_=42        0.00          -- _N_=42
NOTE: Invalid data for se in line 44 29-30.
44            43        0.00          -- 30
p_=43 est=0 se=. grad=. _ERROR_=1 _INFILE_=43        0.00          -- _N_=43
NOTE: Invalid data for se in line 45 29-30.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+--
45            44        0.00          -- 30
p_=44 est=0 se=. grad=. _ERROR_=1 _INFILE_=44        0.00          -- _N_=44
NOTE: Invalid data for se in line 46 29-30.
46            45        0.00          -- 30
p_=45 est=0 se=. grad=. _ERROR_=1 _INFILE_=45        0.00          -- _N_=45
NOTE: Invalid data for se in line 47 29-30.
WARNING: Limit set by ERRORS= option reached.  Further errors of this type will not be
         printed.
47            46        0.00          -- 30
p_=46 est=0 se=. grad=. _ERROR_=1 _INFILE_=46        0.00          -- _N_=46
NOTE: 92 records were read from the infile DBG_NEW.
      The minimum record length was 21.
      The maximum record length was 42.
NOTE: The data set WORK.NEW_DBG31 has 91 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

..

Tom · Posted 01-31-2023 05:31 PM

You can use the ?? input modifier to suppress the messages about invalid data.

data want;
  infile sample firstobs=2 truncover;
  input (P Est SE Grad) (??);
  if missing(p) then stop;
run;

You could also just read ALL of the files in one data step.

data all;
  length fileno 8 filename $256 ;
  do fileno=1 to 31 ;
    filename = cats("&newfolder\",symget(cats('file',fileno)),'.txt');
    infile text filevar=filename firstobs=2 end=eof;
    p=0;
    do while(not missing(p) and not eof);
      input (P Est SE Grad) (??);
      if not missing(p) then output;
    end;
  end;
run;

A_Kh · Posted 01-31-2023 06:15 PM

Thank you @Tom and @Kurt_Bremser !

I appreciate your support, this way of reading data in SAS is something fundamental that I should learn.

@Tom , regarding the second part of the code, each file (31 files) is located in a separate folder with a different names. Only common part is .txt extension for all files(different file names as well). Therefore, I've hardcoded the list of files in earlier steps into &file1-&file31.. This part is something I could handle by myself, but thank you so much for your input!

Tom · Posted 01-31-2023 09:51 PM

@A_Kh wrote:

Thank you @Tom and @Kurt_Bremser !

I appreciate your support, this way of reading data in SAS is something fundamental that I should learn.

@Tom , regarding the second part of the code, each file (31 files) is located in a separate folder with a different names. Only common part is .txt extension for all files(different file names as well). Therefore, I've hardcoded the list of files in earlier steps into &file1-&file31.. This part is something I could handle by myself, but thank you so much for your input!

Note that it is easier to just put the list of names into a DATASET instead of bothering to try to figure out how to use the macro language to generate code.

data files;
  infile cards truncover ;
  input filename $256. ;
cards;
filename1.txt
filename2.txt
;

data want;
  set files;
  filevar=filename;
  infile dummy firstobs=2 truncover filevar=filevar end=eof;
  do while (not eof);
      input ....
      output;
  end;
run;

You could even combine the two data steps into one if you wanted.

A_Kh · Posted 01-31-2023 06:20 PM

Awesome! This really works.

Thank you so much, @Tom and @Kurt_Bremser !

Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Re: Reading external unstructured .txt file into sas: how to read specific rows

Registration is open

Registration is open

SAS Training: Just a Click Away