Reading complex flat file - multiple lines with varying lines numbers

Reply
Occasional Contributor
Posts: 15

Reading complex flat file - multiple lines with varying lines numbers

[ Edited ]

Hello, stumped once again on how to read a complex flat file. I want to create single observations from a section of text that spans multiple lines, each line can be defined as it's own variable but not every variable will be available with each section, and a subsection (the job section) should be read as new observations within. 

 

(variable)   Example data_line:

(name)    SCHEDULE SERVER#name1

          DESCRIPTION *can be on multiple lines, should be ignored;
(runDay)  ON RUNCYCLE date1

          EXCEPTION rule1 *can be on multiple lines, should be ignored;
(runTime) AT time11 UNTIL time12
(rep)     REPEAT yes

(nameDep) FOLLOWS namex
          : *every section has this break line where jobs are listed;
(job )    SERVER#job1

(jobTime)  AT timej11 UNTIL timej12

(fileDep)  OPENS filename

(jobDep)   FOLLOWS jobx

 

(job)     SERVER#job2

(jobDep)  FOLLOWS joby

          END *every section ends with this line but I don't need it;

 

From the sample data above, what's in CAPS is constantly there if the variable is available so they can be used as pointers/delimiters. The sample text would also create only two observations in the output, and the block would repeat but some lines would be missing, which lines will be missing is unknown. The blocks are also separated by blank lines and comments that lead with '#'. Variables that are present always occur in the same order. The output for the above text plus a second block of text would look like this:

(name)  (runDay)  (runTime)  (rep)     (nameDep)  (job)  (jobTime)  (fileDep)  (jobDep) 

 name1   day1      time11     yes       namex      job1   timej1     filename   jobx

 name1   day1      time11     yes       namex      job2                         joby

 name2   day2                 yes                  job1   timej1     filename   

 name2   day2                 yes                  job2   timej2                job1

 name2   day2                 yes                  job3                         job2

 

 

After all that, this is my code so far. Am I over complicating it? How can I finish it? My [biggest] current problem is that when variables are missing, the next variable does not have what was previously read.

 

Current code:

data myOutput;
infile myFile truncover;
input;
retain name runDay runTime rep nameDep job jobTime fileDep jobDep;
array myVars{*} $250 name runDay runTime rep nameDep job jobTime fileDep jobDep;

if _infile_ =: "SCHEDULE" then do; 
	myVars(1) = _infile_;

	do i = 2 to dim(myVars);
		input @1 myVars(i) $250.;
			if myVars(i) =: 'DESCRIPTION' then do; i=i-1; end;
			if myVars(i) =: ':' then do; myVars(i+1)= myVars(i); myVars(i)= ' '; end;
/*need more code here to read the rest of file*/
	end;
end; 
run;

Despite it being a long description of the problem, let me know if I can be more clear! Thanks so much for reading!

Regular Contributor
Posts: 161

Re: Reading complex flat file - multiple lines with varying lines numbers

Do these CAPS words appear in the same position in the record every time?

Kannan Deivasigamani
Occasional Contributor
Posts: 15

Re: Reading complex flat file - multiple lines with varying lines numbers

Yes, they do.
Grand Advisor
Posts: 10,211

Re: Reading complex flat file - multiple lines with varying lines numbers

It may help to provide a few more examples of input data to show more of the actual behavior but a key part that do not output anything until you get to the END.

One important thing to know is there something that always indicates the rows with the variables and only appears at the beginning of those lines? For example does "ON RUNCYCLE" always start the Runday value and can it appear at the start of any other line? If that is the case then you can parse each line. Your start is good, just incomplete.

 

So add additional lines like:

 

if _infile_ =: "ON RUNCYCLE" then runDay= _infile_; /* or parse as needed*/

if _infile_ =: "AT" then runTime=_infile_;  /* this may require if _infile_ =: "AT" and index(_infile_,"UNTIL")>0 .... if UNTIL is needed to differentiate from text that may appear in the other lines you don't want*/

continu in this pattern until:

if _infile_ = "END" then output;

 

There is unfortunately your kicker about the the block of variables related to the job variables. Which means you need a flag that resets to 0 when you hit end and is set to 1 at the first encounter of SERVER#. So if you find SERVER# AND the flag has been set then before setting the job variable value you need to output the accumulated variables and reset the the other variables. Something like:

if _infile_ =: "SERVER" then do;

   if flag=1 then do;

      output;

      job= _infile_; /* of the parsed bit*/

      call missing(jobtime,filedep,jobdep);

   end;

   else do;

       job=_infile_;

       flag=1;

   end;

end;

/* and the END part might look more like*/

if _infile_ = "END" then do;

   output;

   flag=0;

   /* and because not all of the variables will be set for each record reinitialize to missing*/

   call missing (name, runDay, runTime, rep, nameDep, job, jobTime, fileDep, jobDep);

end;

Occasional Contributor
Posts: 15

Re: Reading complex flat file - multiple lines with varying lines numbers

@ballardw
if I am to put more "if _infile_ =:" statements to create my variables, where would I place them? It'd have to be in some sort of loop because I want it to re-enter the variable for each new SCHEDULE and don't want the previous to be retained if that variable does not exist for that new schedule.
Grand Advisor
Posts: 10,211

Re: Reading complex flat file - multiple lines with varying lines numbers

Unless there are multiple items to be read within one schedule other that the job bit then it would be:

One infile for each of the items from (name) to (namedep) and then the bit with the flags for the (job) items.

 

That's why I mentioned the End flag after output does a call missing to clear everything up. Call Missing is a handy routine that sets the values for the listed variables to missing.

 

Esteemed Advisor
Esteemed Advisor
Posts: 7,217

Re: Reading complex flat file - multiple lines with varying lines numbers

Hi,

 

Well, first question I would ask is why is it in that format, and can it be changed.  XML would be far better as an output to deal with and most commercial products will output to that now.  

 

If your stuck with it though, my personal preference would be to read in the flat file as one big column of text.  Next have datastep which goest through the data and pulls the tags (text) out into a separate column.  You are then left with a dataset with a column of tags, and a column of data which you can further process to whatever end you like.

 

data have;
  infile datalines dlm="¬";
  length line $2000;
  input line;
datalines;
(variable)   Example data_line:
(name)    SCHEDULE SERVER#name1
          DESCRIPTION 
(runDay)  ON RUNCYCLE date1
          EXCEPTION rule1 
(runTime) AT time11 UNTIL time12
(rep)     REPEAT yes
(nameDep) FOLLOWS namex
          : 
(job )    SERVER#job1
(jobTime)  AT timej11 UNTIL timej12
(fileDep)  OPENS filename
(jobDep)   FOLLOWS jobx
 
(job)     SERVER#job2
(jobDep)  FOLLOWS joby
          END 
;
run;

data want;
  set have;
  length tags $200;
  retain tags;
  if index(line,"(") then do;
    tags=substr(line,index(line,"("),(index(line,")")-index(line,"(")+1));
    line=strip(tranwrd(line,strip(tags),""));
  end;
run;
Occasional Contributor
Posts: 15

Re: Reading complex flat file - multiple lines with varying lines numbers

I still haven't figured this out, here's a better picture of the data and the desired out put - I tried to create some different combinations of texts that one would expect. Sorry for the tiny output font, didn't know how to fit it all on the screen.

 

Sample txt data (I can only get it in a flat file, no XML):

##comment - block of text ONE, 3 jobs so 3 observations

 

SCHEDULE SERVER#name1
DESCRIPTION some comment
DESCRIPTION some comment
ON RUNCYCLE run day 1
EXCEPT some comment
AT run time 1
REPEAT
OPENS file name x
FOLLOWS some name
FOLLOWS some other name
:
SERVER#job1
 AT job1 time
OPENS file name y
 FOLLOWS some job

 

SERVER#job2
 FOLLOWS job1

SERVER#job3
END

 

##comment - block of text TWO, only 1 job so 1 observation

 

SCHEDULE SERVER#name2
ON RUNCYCLE run day 2
REPEAT
FOLLOWS some name
:
SERVER#job1
END

 

#comment1 - block of text THREE, no "REPEAT" or "FOLLOWS", 7 jobs so 7 observations
#comment2

 

SCHEDULE SERVER#name3
ON RUNCYCLE run day 3
AT run time 3
:
SERVER#job1

 

SERVER#job2
 FOLLOWS job1

 

SERVER#job3
 FOLLOWS job2

 

SERVER#job4
 FOLLOWS job3

 

SERVER#job5
 FOLLOWS job4
END

 

#comment1 - block of text FOUR, 10 jobs so 10 observations
#comment2

 

SCHEDULE SERVER#name4
ON RUNCYCLE run day 4
REPEAT
:
OIANP143-TB#job1

 

OIANP143-TB#job2
 FOLLOWS job1

 

OIANP143-TB#job3

 

OIANP143-TB#job4
 FOLLOWS job1
 FOLLOWS job2

 

OIANP143-TB#job5
 AT job 5 time

 FOLLOWS job4

 

OIANP143-TB#job6
END

 

Desired output:

NAME  RUN_DAY   RUN_TIME   REPT   NAME_OPENS  NAME_DEP                   JOB  JOB_RUNTIME JOB_OPENS   JOB_DEP
name1 run day 1 run time 1 repeat file name x some name, some other name job1 job1 time   file name y some job
name1 run day 1 run time 1 repeat file name x some name, some other name job2                         job1
name1 run day 1 run time 1 repeat file name x some name, some other name job3
name2 run day 2            repeat             some name                  job1
name3 run day 3 run time 3                                               job1
name3 run day 3 run time 3                                               job2                         job1
name3 run day 3 run time 3                                               job3                         job2
name3 run day 3 run time 3                                               job4                         job3
name4 run day 4            repeat                                        job1
name4 run day 4            repeat                                        job2                         job1
name4 run day 4            repeat                                        job3
name4 run day 4            repeat                                        job4                         job1, job2
name4 run day 4            repeat                                        job5 job 5 time              job4
name4 run day 4            repeat                                        job6

Ask a Question
Discussion stats
  • 7 replies
  • 341 views
  • 0 likes
  • 4 in conversation