DATA Step, Macro, Functions and more

Infile input - preserve linefeeds

Reply
New Contributor
Posts: 3

Infile input - preserve linefeeds

[ Edited ]

Hi all. My first post so please go easy on me Smiley Happy Our team is using SAS Contextual Analysis to do pull matching text (i.e. sick leave, wages, etc) from a bunch of collective agreements (samples: https://www.sdc.gov.on.ca/sites/mol/drs/ca/).

 

Our process takes two steps, first step is to create a bunch of concept rules in the Contextual Analysis and process the text files to generate a number of CA datasets. The second step is to run SAS codes in the Enterprise Guide to extract a blob of text surrounding the matched terms. I'm currently trying to extract wage tables from the collective agreements.

 

Here's what I would like to extract from the original text file (converted from PDF): (Forum software messed up the format.  Please see the attached 611-12921-14 (805-0145).pdf.txt file)

 

SALARY GRID FOR FULL-TIME INSTRUCTORS May 1, 2010 STEPS Base 1 2 3 4 5 6 7 8 9 10 12-month contract $44,908 $46,744 $48,580 $50,416 $52,252 $54,088 $55,924 $57,760 $59,596 $61,432 $63,268 10-month contract $37,423 $38,953 $40,483 $42,013 $43,543 $45,073 $46,603 $48,133 $49,663 $51,193 $52,723 May 1, 2011 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $45,357 $47,211 $49,065 $50,919 $52,773 $54,627 $56,481 $58,335 $60,189 $62,043 $63,897 10-month contract $37,798 $39,343 $40,888 $42,433 $43,978 $45,523 $47,068 $48,613 $50,158 $51,703 $53,248 May 1, 2012 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $46,264 $48,155 $50,046 $51,937 $53,828 $55,719 $57,610 $59,501 $61,392 $63,283 $65,174 10-month contract $38,553 $40,129 $41,705 $43,281 $44,857 $46,433 $48,008 $49,584 $51,160 $52,736 $54,312 May 1, 2013 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $47,189 $49,118 $51,047 $52,976 $54,905 $56,834 $58,763 $60,692 $62,621 $64,550 $66,479 10-month contract $39,324 $40,932 $42,539 $44,147 $45,754 $47,362 $48,969 $50,577 $52,184 $53,792 $55,399 Salary scale excludes 4% vacation pay.

 

Here's the relevant code: 

 

%do i = 1 %to &counter;

/*%put Filename &&filename&i;*/
%let original_length = &&originallength&i;

data snippet_&concept; /* opens the txt file and reads in starting at the offset position*/
infile "&&fr&i." lrecl=1000000 recfm=f truncover;
length additional_provision_text $1000;
input @&&offset&i additional_provision_text $&totchnk.. @;
length provision_text $1000;
input @&&originalstartoffset&i provision_text $&original_length..;
length quantifiable_value $10;
quantifiable_value = "&&quantifiable&i";
length document_filename $256;
document_filename = "&&filename&i";
start_offset = &&originalstartoffset&i;
end_offset = &&originalendoffset&i;
length = &&original_length;
document_id = &&docid&i;
ROW_ID= &&ROWID&i;
run;

proc append base = &concept /*appends each record to a data set*/
data = snippet_&concept force;
run;

%end;
%mend do_snippet;
%do_snippet;

 

I have attached the exported dataset to this post. As you can see, all the linefeeds are removed in the "additional_provision_text" column From the "611-12921-14 (805-0145).pdf.txt" file

 

 

SALARY GRID FOR FULL-TIME INSTRUCTORS May 1, 2010 STEPS Base 1 2 3 4 5 6 7 8 9 10 12-month contract $44,908 $46,744 $48,580 $50,416 $52,252 $54,088 $55,924 $57,760 $59,596 $61,432 $63,268 10-month contract $37,423 $38,953 $40,483 $42,013 $43,543 $45,073 $46,603 $48,133 $49,663 $51,193 $52,723 May 1, 2011 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $45,357 $47,211 $49,065 $50,919 $52,773 $54,627 $56,481 $58,335 $60,189 $62,043 $63,897 10-month contract $37,798 $39,343 $40,888 $42,433 $43,978 $45,523 $47,068 $48,613 $50,158 $51,703 $53,248

 

 

Can someone please tell me how to preserve the linefeed when the code read in the text from the source files? Thanks a lot

Super User
Posts: 23,354

Re: Infile input - preserve linefeeds

Posted in reply to stray_tachyon

You could do this, but I would recommend you consider using Adobe Pro instead or the R package tabulizer. This is likely a one time initiative and this is more likely to be accurate and faster. 

 

Adobe Pro allows you to convert the document to text or extract a table relatively easy using either the GUI or JavaScript if you're coding. 

 

 

 

New Contributor
Posts: 3

Re: Infile input - preserve linefeeds

We have to process thousands of files.  Using Acrobat Pro is not feasible

Super User
Posts: 23,354

Re: Infile input - preserve linefeeds

Posted in reply to stray_tachyon

Adobe Pro works with JavaScript and has macro type functionality as well. I'm not suggesting point and click here either.

 


@stray_tachyon wrote:

We have to process thousands of files.  Using Acrobat Pro is not feasible


 

Super User
Posts: 13,358

Re: Infile input - preserve linefeeds

Posted in reply to stray_tachyon

@stray_tachyon wrote:

 

Here's what I would like to extract from the original text file (converted from PDF): (Forum software messed up the format.  Please see the attached 611-12921-14 (805-0145).pdf.txt file)


There are two options in this forum to reduce the interference of the forum software and text formatting. The icon bar a the top of the message box has one icon {I} that opens a basic code box, no color highlighting or such, which is usually the best choice for data though I also use if for code. The other is the "running man" to the right of the "{I}". This box will color format SAS code and looks "prettier".

Paste code or data into either one and the data should appear at least somewhat cleaner. Things like TAB characters though may appear differently as it seems the forum uses a largish number of spaces to display tabs.

Super User
Super User
Posts: 7,944

Re: Infile input - preserve linefeeds

Posted in reply to stray_tachyon

I think you have hidden your basic question under a flurry of too much information.

Can someone please tell me how to preserve the linefeed when the code read in the text from the source files? 

What do you mean by this statement?  Are you saying you want to store multiple lines from the input text file into a single observation of a variable?  If so then you want some variation on code like this.  You could use '0A'x to indicate a linefeed instead of '|' in the CATX() call.

data test ;
  infile 'myfile.txt' ;
  input var1 $20. / var2 $20. ;
  var3 = catx('|',var1,var2);
run;

If instead you mean that you want SAS to treat bare linefeeds in the raw text as normal characters and not end of line indicators then you need two things. 

1) The lines have to have something else to make the true end of line. So either CR+LF like on Windows/DOS.  Or possible bare CR like the original MacOS used to use.

2) You need to tell the INFILE statement which of those to use.

data test ;
  infile 'myfile.txt' termstr=crlf ;
  input var3 $50. ;
run;

 

New Contributor
Posts: 3

Re: Infile input - preserve linefeeds

[ Edited ]

Thanks a lot for your suggestions

 

The code specified a location (@&&offset&i) in the text file, the number of characters ($&totchnk) to read in and place the section of text into variable "additional_provision_text".  I would like SAS place all characters, including \r\n.  Is that doable?

 

			infile "&&fr&i." lrecl=1000000 recfm=f truncover;
			length additional_provision_text $1000;
			input  @&&offset&i additional_provision_text $&totchnk.. @;

This is what I want to be in the "additional_provision_text" variable in its entirety

 

                                        SALARY GRID FOR FULL-TIME INSTRUCTORS

May 1, 2010                                                  STEPS

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $44,908  $46,744           $48,580  $50,416  $52,252     $54,088  $55,924                $57,760  $59,596  $61,432  $63,268

10-month

contract     $37,423  $38,953           $40,483  $42,013  $43,543     $45,073  $46,603                $48,133  $49,663  $51,193  $52,723

May 1, 2011

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $45,357  $47,211           $49,065  $50,919  $52,773     $54,627  $56,481                $58,335  $60,189  $62,043  $63,897

10-month

contract     $37,798  $39,343           $40,888  $42,433  $43,978     $45,523  $47,068                $48,613  $50,158  $51,703  $53,248

May 1, 2012

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $46,264  $48,155           $50,046  $51,937  $53,828     $55,719  $57,610                $59,501  $61,392  $63,283  $65,174

10-month

contract     $38,553  $40,129           $41,705  $43,281  $44,857     $46,433  $48,008                $49,584  $51,160  $52,736  $54,312

May 1, 2013

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $47,189  $49,118           $51,047  $52,976  $54,905     $56,834  $58,763                $60,692  $62,621  $64,550  $66,479

10-month

contract     $39,324  $40,932           $42,539  $44,147  $45,754     $47,362  $48,969                $50,577  $52,184  $53,792  $55,399

Salary scale excludes 4% vacation pay.

Thanks

 

Ask a Question
Discussion stats
  • 6 replies
  • 104 views
  • 0 likes
  • 4 in conversation