BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rmacarthur
Pyrite | Level 9

Hi SAS Friends,

Need to divide up 100's of structured SOP documents (.TXT) into sections that can be edited, updated, replaced, modified, in a controlled manner.  Sample .TXT file is attached.  

 

Each SOP document can be brought into SAS and a single .SAS file created.

 

Each SOP has standardized headings, such as 

NUMBER, TITLE:, POLICY, PURPOSE, APPLICABILITY, RESPONSIBILITY, MATERIALS, PROCEDURE,

REFERENCES:, CROSS-REFERENCES to other policies:, KEYWORDS: , ORIGINAL DATE ISSUED:, DATE(S) REVIEWED:, DATE(S) REVISED:, REVIEWED BY: 

 

Have looked thru many papers using SAS PRX functions , and discussions of using INDEXW(), and INDEX() functions, etc. but cannot find one that addresses this type of application.

 

I can see how INDEXW() could be used to define the start and end of each section, and so create a single variable such as "Title", that would contain all text starting after TITLE, and finishing before "POLICY", can be created.  However, how to manage variability in variable lengths (sentences and paragraphs) is not clear, and how to create these separate variables is not clear.

 

Can you point me to a paper or book that covers this for BASE SAS ? or should a different software tool be used?

Any suggestions greatly appreciated.

Thank you  

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

This should get you started.

filename widget "<path to your file>\SOP_widget.txt";

data work.rawwidget;
   length number $ 25. title $ 500. 
            policy purpose applicability responsibility
            materials procedure references cross_references keywords
            original_date_issued date_reviewed date_revised reviewed_by
            $ 32767.
            tempstr $ 32767
   ;
   retain number  title 
            policy purpose applicability responsibility
            materials procedure references cross_references keywords
            original_date_issued date_reviewed date_revised reviewed_by
            tempstr 
   ;
   infile widget lrecl=32767 dsd eof=Lastline;
start:   input @;
   if _infile_ =: 'NUMBER:' then do;
      number = strip(tranwrd(_infile_,'NUMBER:',''));
   end;
   if _infile_ =: 'TITLE:' then do;
      title  = strip(tranwrd(_infile_,'TITLE:',''));
   end;
   if _infile_ =: 'POLICY:' then do;
      policy = strip(tranwrd(_infile_,'POLICY:',''));
   end;
   if _infile_ =: 'PURPOSE:' then do;
      purpose = strip(tranwrd(_infile_,'POLICY:',''));
   end;
   if _infile_ =: 'APPLICABILITY:' then do;
      applicability = strip(tranwrd(_infile_,'APPLICABILITY:',''));
   end;
   if _infile_ =: 'RESPONSIBILITY:' then do;
      responsibility = strip(tranwrd(_infile_,'RESPONSIBILITY:',''));
   end;
   if _infile_ =: 'Materials' then do ;
      input;
      do until (scan(_infile_,1) = 'Procedure' );
         /* at this point we have a section that needs to be read from multiple lines
            so we read the current line and place into an accumulator variable
         */
         input ;
         if scan(_infile_,1) = 'Procedure' then leave;
         materials = catx('; ',materials,_infile_);
      end;
   end;

   /* advance to next line in file*/
   input;
   /* continue reading next line into same record*/
   goto start; 
   /* write to output set when last line is read*/
   lastline: Output;
run;

    

This has some assumptions such as the examples that only occupied one line will consider to do so. I assumed that your title may not be as long and set 500 characters, increase as needed. If you are sure other variables will be shorter you can set them to a desired maximum length.

There are some moderately complicated timing issues to get only one record output with each variable filled. The RETAIN statement is likely not needed but I can foresee the possibility of needing some OUTPUT statements to debug some behavior.

I only did ONE of the multiple line reader sections leaving that as an example.

There are Labels, START and LASTLINE. These are references for program flow. The LASTLINE is intended to only write when the end of the file is encountered and uses the EOF infile option to tell SAS when to execute that statement. The GOTO goes back to reading to advance through the file. If you remove, or comment that line out you will get an output record for each line in the file.

 

_INFILE_ is an automatic SAS variable that holds the current input buffer (up to 32K characters) line so you can search it do things conditionally, even edit.

The =: is a "starts with" so if the line starts with your keyword the program executes the desired bit. This is case sensitive and if your files are consistent in spelling, capitalization and placement should not be a problem. May have to modify code if not.

 

There is some slightly tricky bits in the Multiple line reading for Materials and if you have a Materials section NOT followed by Procedure there is going to be a problem. The solution would be looking for any of the following key words.

There is a not really common instruction LEAVE in the loop that tells the program to leave the loop before adding the word Procedure to the end of the Materials list. Likely there are a couple of other ways but I'm not getting paid to optimize code, this is volunteer labor.

I placed a '; ' between sections of the materials because data with line feeds as part of the values can cause numerous problems. The CATX strips blanks so the code doesn't need to do that. You may or may not need a line separator in the other multi-line bits like p

 

At the end of the  loop that reads materials you should be on the first line where Procedure occurs.

View solution in original post

7 REPLIES 7
ballardw
Super User

@rmacarthur wrote:

Each SOP document can be brought into SAS and a single .SAS file created.

 


Since .SAS files are programs I do not understand the creation of program files in this context.

 

Manipulating content would typically mean that the content has been placed into a SAS data set. At which point the "paragraph" may run into issues due to limits on the length of SAS variables unless you can state without fear of contradiction that each "paragraph" will be less than 32k characters.

 

Actually INDEXW wouldn't be that much help. The automatic variable _infile_ is likely to be more helpful, especially if keywords appear in the first column like the example. You are likely to have issues with creating single variables due to variable length, the presence of end of line characters.

 

How to expect to edit, update and modify such text in a controlled manner?

 

I have a sneaking suspicion that program source control software might be more along the lines of what is needed but don't have any specific recommendation as to which.

rmacarthur
Pyrite | Level 9

Thank you, 

Yes, the content would be brought into SAS using an INFILE statement, and would be a single SAS dataset. 

A SAS program would then parse the dataset into sections, and make a new variable for each section.

No single section would be larger than 32k characters, so can work within that limitation with no difficulty.

 

ballardw
Super User

This should get you started.

filename widget "<path to your file>\SOP_widget.txt";

data work.rawwidget;
   length number $ 25. title $ 500. 
            policy purpose applicability responsibility
            materials procedure references cross_references keywords
            original_date_issued date_reviewed date_revised reviewed_by
            $ 32767.
            tempstr $ 32767
   ;
   retain number  title 
            policy purpose applicability responsibility
            materials procedure references cross_references keywords
            original_date_issued date_reviewed date_revised reviewed_by
            tempstr 
   ;
   infile widget lrecl=32767 dsd eof=Lastline;
start:   input @;
   if _infile_ =: 'NUMBER:' then do;
      number = strip(tranwrd(_infile_,'NUMBER:',''));
   end;
   if _infile_ =: 'TITLE:' then do;
      title  = strip(tranwrd(_infile_,'TITLE:',''));
   end;
   if _infile_ =: 'POLICY:' then do;
      policy = strip(tranwrd(_infile_,'POLICY:',''));
   end;
   if _infile_ =: 'PURPOSE:' then do;
      purpose = strip(tranwrd(_infile_,'POLICY:',''));
   end;
   if _infile_ =: 'APPLICABILITY:' then do;
      applicability = strip(tranwrd(_infile_,'APPLICABILITY:',''));
   end;
   if _infile_ =: 'RESPONSIBILITY:' then do;
      responsibility = strip(tranwrd(_infile_,'RESPONSIBILITY:',''));
   end;
   if _infile_ =: 'Materials' then do ;
      input;
      do until (scan(_infile_,1) = 'Procedure' );
         /* at this point we have a section that needs to be read from multiple lines
            so we read the current line and place into an accumulator variable
         */
         input ;
         if scan(_infile_,1) = 'Procedure' then leave;
         materials = catx('; ',materials,_infile_);
      end;
   end;

   /* advance to next line in file*/
   input;
   /* continue reading next line into same record*/
   goto start; 
   /* write to output set when last line is read*/
   lastline: Output;
run;

    

This has some assumptions such as the examples that only occupied one line will consider to do so. I assumed that your title may not be as long and set 500 characters, increase as needed. If you are sure other variables will be shorter you can set them to a desired maximum length.

There are some moderately complicated timing issues to get only one record output with each variable filled. The RETAIN statement is likely not needed but I can foresee the possibility of needing some OUTPUT statements to debug some behavior.

I only did ONE of the multiple line reader sections leaving that as an example.

There are Labels, START and LASTLINE. These are references for program flow. The LASTLINE is intended to only write when the end of the file is encountered and uses the EOF infile option to tell SAS when to execute that statement. The GOTO goes back to reading to advance through the file. If you remove, or comment that line out you will get an output record for each line in the file.

 

_INFILE_ is an automatic SAS variable that holds the current input buffer (up to 32K characters) line so you can search it do things conditionally, even edit.

The =: is a "starts with" so if the line starts with your keyword the program executes the desired bit. This is case sensitive and if your files are consistent in spelling, capitalization and placement should not be a problem. May have to modify code if not.

 

There is some slightly tricky bits in the Multiple line reading for Materials and if you have a Materials section NOT followed by Procedure there is going to be a problem. The solution would be looking for any of the following key words.

There is a not really common instruction LEAVE in the loop that tells the program to leave the loop before adding the word Procedure to the end of the Materials list. Likely there are a couple of other ways but I'm not getting paid to optimize code, this is volunteer labor.

I placed a '; ' between sections of the materials because data with line feeds as part of the values can cause numerous problems. The CATX strips blanks so the code doesn't need to do that. You may or may not need a line separator in the other multi-line bits like p

 

At the end of the  loop that reads materials you should be on the first line where Procedure occurs.

rmacarthur
Pyrite | Level 9

This is a fantastic starting point, will work with this code to get the project going, we're off to a good start !  Many thanks ! 

ballardw
Super User

@rmacarthur wrote:

This is a fantastic starting point, will work with this code to get the project going, we're off to a good start !  Many thanks ! 


Caution and a potentially helpful note though the program structure would change somewhat. You can read multiple files using a single Infile statement. There are options that let you get the name of the current file read which allow adding a variable with that information, and flags for when a new file is being read. So if you have a lot of these files it may make sense to have a single data set with all of the information in one "master", though I suspect the Number should be sufficient to identify things unless you are building a change log with the different review / modify dates.

 

The fun part is getting the start and stops to match.

rmacarthur
Pyrite | Level 9

Yes, exactly, and that's a great suggestion too, thank you.

Ideally, each final SAS data file name will be the SOP number.  Am working with the code now to get accustomed to the starts and stops and see how to add those additional features.  THis is a tremendous help, 

 

All the best, 

Robert 

ballardw
Super User

@rmacarthur wrote:

Yes, exactly, and that's a great suggestion too, thank you.

Ideally, each final SAS data file name will be the SOP number.  Am working with the code now to get accustomed to the starts and stops and see how to add those additional features.  THis is a tremendous help, 

 

All the best, 

Robert 


If the base text file names are actually the SOP number then search the forum for the multiple times the "how to read multiple files" comes up. A pipe can be used to read the results of operating system directory command to get the names of the files and then use a data set with those names to drive the program with call execute in a data _null_ step. If your SOP numbers have characters other than letters, digits and the underscore character or are longer than 32 characters you will have to decide whether the "identity" of name vs standard SAS data set names is more important.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1334 views
  • 4 likes
  • 2 in conversation