BookmarkSubscribeRSS Feed
stray_tachyon
Fluorite | Level 6

Hi all. My first post so please go easy on me 🙂 Our team is using SAS Contextual Analysis to do pull matching text (i.e. sick leave, wages, etc) from a bunch of collective agreements (samples: https://www.sdc.gov.on.ca/sites/mol/drs/ca/).

 

Our process takes two steps, first step is to create a bunch of concept rules in the Contextual Analysis and process the text files to generate a number of CA datasets. The second step is to run SAS codes in the Enterprise Guide to extract a blob of text surrounding the matched terms. I'm currently trying to extract wage tables from the collective agreements.

 

Here's what I would like to extract from the original text file (converted from PDF): (Forum software messed up the format.  Please see the attached 611-12921-14 (805-0145).pdf.txt file)

 

SALARY GRID FOR FULL-TIME INSTRUCTORS May 1, 2010 STEPS Base 1 2 3 4 5 6 7 8 9 10 12-month contract $44,908 $46,744 $48,580 $50,416 $52,252 $54,088 $55,924 $57,760 $59,596 $61,432 $63,268 10-month contract $37,423 $38,953 $40,483 $42,013 $43,543 $45,073 $46,603 $48,133 $49,663 $51,193 $52,723 May 1, 2011 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $45,357 $47,211 $49,065 $50,919 $52,773 $54,627 $56,481 $58,335 $60,189 $62,043 $63,897 10-month contract $37,798 $39,343 $40,888 $42,433 $43,978 $45,523 $47,068 $48,613 $50,158 $51,703 $53,248 May 1, 2012 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $46,264 $48,155 $50,046 $51,937 $53,828 $55,719 $57,610 $59,501 $61,392 $63,283 $65,174 10-month contract $38,553 $40,129 $41,705 $43,281 $44,857 $46,433 $48,008 $49,584 $51,160 $52,736 $54,312 May 1, 2013 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $47,189 $49,118 $51,047 $52,976 $54,905 $56,834 $58,763 $60,692 $62,621 $64,550 $66,479 10-month contract $39,324 $40,932 $42,539 $44,147 $45,754 $47,362 $48,969 $50,577 $52,184 $53,792 $55,399 Salary scale excludes 4% vacation pay.

 

Here's the relevant code: 

 

%do i = 1 %to &counter;

/*%put Filename &&filename&i;*/
%let original_length = &&originallength&i;

data snippet_&concept; /* opens the txt file and reads in starting at the offset position*/
infile "&&fr&i." lrecl=1000000 recfm=f truncover;
length additional_provision_text $1000;
input @&&offset&i additional_provision_text $&totchnk.. @;
length provision_text $1000;
input @&&originalstartoffset&i provision_text $&original_length..;
length quantifiable_value $10;
quantifiable_value = "&&quantifiable&i";
length document_filename $256;
document_filename = "&&filename&i";
start_offset = &&originalstartoffset&i;
end_offset = &&originalendoffset&i;
length = &&original_length;
document_id = &&docid&i;
ROW_ID= &&ROWID&i;
run;

proc append base = &concept /*appends each record to a data set*/
data = snippet_&concept force;
run;

%end;
%mend do_snippet;
%do_snippet;

 

I have attached the exported dataset to this post. As you can see, all the linefeeds are removed in the "additional_provision_text" column From the "611-12921-14 (805-0145).pdf.txt" file

 

 

SALARY GRID FOR FULL-TIME INSTRUCTORS May 1, 2010 STEPS Base 1 2 3 4 5 6 7 8 9 10 12-month contract $44,908 $46,744 $48,580 $50,416 $52,252 $54,088 $55,924 $57,760 $59,596 $61,432 $63,268 10-month contract $37,423 $38,953 $40,483 $42,013 $43,543 $45,073 $46,603 $48,133 $49,663 $51,193 $52,723 May 1, 2011 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $45,357 $47,211 $49,065 $50,919 $52,773 $54,627 $56,481 $58,335 $60,189 $62,043 $63,897 10-month contract $37,798 $39,343 $40,888 $42,433 $43,978 $45,523 $47,068 $48,613 $50,158 $51,703 $53,248

 

 

Can someone please tell me how to preserve the linefeed when the code read in the text from the source files? Thanks a lot

6 REPLIES 6
Reeza
Super User

You could do this, but I would recommend you consider using Adobe Pro instead or the R package tabulizer. This is likely a one time initiative and this is more likely to be accurate and faster. 

 

Adobe Pro allows you to convert the document to text or extract a table relatively easy using either the GUI or JavaScript if you're coding. 

 

 

 

stray_tachyon
Fluorite | Level 6

We have to process thousands of files.  Using Acrobat Pro is not feasible

Reeza
Super User

Adobe Pro works with JavaScript and has macro type functionality as well. I'm not suggesting point and click here either.

 


@stray_tachyon wrote:

We have to process thousands of files.  Using Acrobat Pro is not feasible


 

ballardw
Super User

@stray_tachyon wrote:

 

Here's what I would like to extract from the original text file (converted from PDF): (Forum software messed up the format.  Please see the attached 611-12921-14 (805-0145).pdf.txt file)


There are two options in this forum to reduce the interference of the forum software and text formatting. The icon bar a the top of the message box has one icon {I}  changed to </> that opens a basic code box, no color highlighting or such, which is usually the best choice for data though I also use if for code. The other is the "running man" to the right of the "{I}". This box will color format SAS code and looks "prettier".

Paste code or data into either one and the data should appear at least somewhat cleaner. Things like TAB characters though may appear differently as it seems the forum uses a largish number of spaces to display tabs.

Tom
Super User Tom
Super User

I think you have hidden your basic question under a flurry of too much information.

Can someone please tell me how to preserve the linefeed when the code read in the text from the source files? 

What do you mean by this statement?  Are you saying you want to store multiple lines from the input text file into a single observation of a variable?  If so then you want some variation on code like this.  You could use '0A'x to indicate a linefeed instead of '|' in the CATX() call.

data test ;
  infile 'myfile.txt' ;
  input var1 $20. / var2 $20. ;
  var3 = catx('|',var1,var2);
run;

If instead you mean that you want SAS to treat bare linefeeds in the raw text as normal characters and not end of line indicators then you need two things. 

1) The lines have to have something else to make the true end of line. So either CR+LF like on Windows/DOS.  Or possible bare CR like the original MacOS used to use.

2) You need to tell the INFILE statement which of those to use.

data test ;
  infile 'myfile.txt' termstr=crlf ;
  input var3 $50. ;
run;

 

stray_tachyon
Fluorite | Level 6

Thanks a lot for your suggestions

 

The code specified a location (@&&offset&i) in the text file, the number of characters ($&totchnk) to read in and place the section of text into variable "additional_provision_text".  I would like SAS place all characters, including \r\n.  Is that doable?

 

			infile "&&fr&i." lrecl=1000000 recfm=f truncover;
			length additional_provision_text $1000;
			input  @&&offset&i additional_provision_text $&totchnk.. @;

This is what I want to be in the "additional_provision_text" variable in its entirety

 

                                        SALARY GRID FOR FULL-TIME INSTRUCTORS

May 1, 2010                                                  STEPS

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $44,908  $46,744           $48,580  $50,416  $52,252     $54,088  $55,924                $57,760  $59,596  $61,432  $63,268

10-month

contract     $37,423  $38,953           $40,483  $42,013  $43,543     $45,073  $46,603                $48,133  $49,663  $51,193  $52,723

May 1, 2011

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $45,357  $47,211           $49,065  $50,919  $52,773     $54,627  $56,481                $58,335  $60,189  $62,043  $63,897

10-month

contract     $37,798  $39,343           $40,888  $42,433  $43,978     $45,523  $47,068                $48,613  $50,158  $51,703  $53,248

May 1, 2012

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $46,264  $48,155           $50,046  $51,937  $53,828     $55,719  $57,610                $59,501  $61,392  $63,283  $65,174

10-month

contract     $38,553  $40,129           $41,705  $43,281  $44,857     $46,433  $48,008                $49,584  $51,160  $52,736  $54,312

May 1, 2013

             Base     1                 2        3        4           5        6                      7        8        9        10

12-month

contract     $47,189  $49,118           $51,047  $52,976  $54,905     $56,834  $58,763                $60,692  $62,621  $64,550  $66,479

10-month

contract     $39,324  $40,932           $42,539  $44,147  $45,754     $47,362  $48,969                $50,577  $52,184  $53,792  $55,399

Salary scale excludes 4% vacation pay.

Thanks

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1323 views
  • 5 likes
  • 4 in conversation