Hi all. My first post so please go easy on me 🙂 Our team is using SAS Contextual Analysis to do pull matching text (i.e. sick leave, wages, etc) from a bunch of collective agreements (samples: https://www.sdc.gov.on.ca/sites/mol/drs/ca/).
Our process takes two steps, first step is to create a bunch of concept rules in the Contextual Analysis and process the text files to generate a number of CA datasets. The second step is to run SAS codes in the Enterprise Guide to extract a blob of text surrounding the matched terms. I'm currently trying to extract wage tables from the collective agreements.
Here's what I would like to extract from the original text file (converted from PDF): (Forum software messed up the format. Please see the attached 611-12921-14 (805-0145).pdf.txt file)
SALARY GRID FOR FULL-TIME INSTRUCTORS May 1, 2010 STEPS Base 1 2 3 4 5 6 7 8 9 10 12-month contract $44,908 $46,744 $48,580 $50,416 $52,252 $54,088 $55,924 $57,760 $59,596 $61,432 $63,268 10-month contract $37,423 $38,953 $40,483 $42,013 $43,543 $45,073 $46,603 $48,133 $49,663 $51,193 $52,723 May 1, 2011 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $45,357 $47,211 $49,065 $50,919 $52,773 $54,627 $56,481 $58,335 $60,189 $62,043 $63,897 10-month contract $37,798 $39,343 $40,888 $42,433 $43,978 $45,523 $47,068 $48,613 $50,158 $51,703 $53,248 May 1, 2012 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $46,264 $48,155 $50,046 $51,937 $53,828 $55,719 $57,610 $59,501 $61,392 $63,283 $65,174 10-month contract $38,553 $40,129 $41,705 $43,281 $44,857 $46,433 $48,008 $49,584 $51,160 $52,736 $54,312 May 1, 2013 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $47,189 $49,118 $51,047 $52,976 $54,905 $56,834 $58,763 $60,692 $62,621 $64,550 $66,479 10-month contract $39,324 $40,932 $42,539 $44,147 $45,754 $47,362 $48,969 $50,577 $52,184 $53,792 $55,399 Salary scale excludes 4% vacation pay.
Here's the relevant code:
%do i = 1 %to &counter;
/*%put Filename &&filename&i;*/
%let original_length = &&originallength&i;
data snippet_&concept; /* opens the txt file and reads in starting at the offset position*/
infile "&&fr&i." lrecl=1000000 recfm=f truncover;
length additional_provision_text $1000;
input @&&offset&i additional_provision_text $&totchnk.. @;
length provision_text $1000;
input @&&originalstartoffset&i provision_text $&original_length..;
length quantifiable_value $10;
quantifiable_value = "&&quantifiable&i";
length document_filename $256;
document_filename = "&&filename&i";
start_offset = &&originalstartoffset&i;
end_offset = &&originalendoffset&i;
length = &&original_length;
document_id = &&docid&i;
ROW_ID= &&ROWID&i;
run;
proc append base = &concept /*appends each record to a data set*/
data = snippet_&concept force;
run;
%end;
%mend do_snippet;
%do_snippet;
I have attached the exported dataset to this post. As you can see, all the linefeeds are removed in the "additional_provision_text" column From the "611-12921-14 (805-0145).pdf.txt" file
SALARY GRID FOR FULL-TIME INSTRUCTORS May 1, 2010 STEPS Base 1 2 3 4 5 6 7 8 9 10 12-month contract $44,908 $46,744 $48,580 $50,416 $52,252 $54,088 $55,924 $57,760 $59,596 $61,432 $63,268 10-month contract $37,423 $38,953 $40,483 $42,013 $43,543 $45,073 $46,603 $48,133 $49,663 $51,193 $52,723 May 1, 2011 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $45,357 $47,211 $49,065 $50,919 $52,773 $54,627 $56,481 $58,335 $60,189 $62,043 $63,897 10-month contract $37,798 $39,343 $40,888 $42,433 $43,978 $45,523 $47,068 $48,613 $50,158 $51,703 $53,248
Can someone please tell me how to preserve the linefeed when the code read in the text from the source files? Thanks a lot
You could do this, but I would recommend you consider using Adobe Pro instead or the R package tabulizer. This is likely a one time initiative and this is more likely to be accurate and faster.
Adobe Pro allows you to convert the document to text or extract a table relatively easy using either the GUI or JavaScript if you're coding.
We have to process thousands of files. Using Acrobat Pro is not feasible
Adobe Pro works with JavaScript and has macro type functionality as well. I'm not suggesting point and click here either.
@stray_tachyon wrote:
We have to process thousands of files. Using Acrobat Pro is not feasible
@stray_tachyon wrote:
Here's what I would like to extract from the original text file (converted from PDF): (Forum software messed up the format. Please see the attached 611-12921-14 (805-0145).pdf.txt file)
There are two options in this forum to reduce the interference of the forum software and text formatting. The icon bar a the top of the message box has one icon {I} changed to </> that opens a basic code box, no color highlighting or such, which is usually the best choice for data though I also use if for code. The other is the "running man" to the right of the "{I}". This box will color format SAS code and looks "prettier".
Paste code or data into either one and the data should appear at least somewhat cleaner. Things like TAB characters though may appear differently as it seems the forum uses a largish number of spaces to display tabs.
I think you have hidden your basic question under a flurry of too much information.
Can someone please tell me how to preserve the linefeed when the code read in the text from the source files?
What do you mean by this statement? Are you saying you want to store multiple lines from the input text file into a single observation of a variable? If so then you want some variation on code like this. You could use '0A'x to indicate a linefeed instead of '|' in the CATX() call.
data test ;
infile 'myfile.txt' ;
input var1 $20. / var2 $20. ;
var3 = catx('|',var1,var2);
run;
If instead you mean that you want SAS to treat bare linefeeds in the raw text as normal characters and not end of line indicators then you need two things.
1) The lines have to have something else to make the true end of line. So either CR+LF like on Windows/DOS. Or possible bare CR like the original MacOS used to use.
2) You need to tell the INFILE statement which of those to use.
data test ;
infile 'myfile.txt' termstr=crlf ;
input var3 $50. ;
run;
Thanks a lot for your suggestions
The code specified a location (@&&offset&i) in the text file, the number of characters ($&totchnk) to read in and place the section of text into variable "additional_provision_text". I would like SAS place all characters, including \r\n. Is that doable?
infile "&&fr&i." lrecl=1000000 recfm=f truncover;
length additional_provision_text $1000;
input @&&offset&i additional_provision_text $&totchnk.. @;
This is what I want to be in the "additional_provision_text" variable in its entirety
SALARY GRID FOR FULL-TIME INSTRUCTORS May 1, 2010 STEPS Base 1 2 3 4 5 6 7 8 9 10 12-month contract $44,908 $46,744 $48,580 $50,416 $52,252 $54,088 $55,924 $57,760 $59,596 $61,432 $63,268 10-month contract $37,423 $38,953 $40,483 $42,013 $43,543 $45,073 $46,603 $48,133 $49,663 $51,193 $52,723 May 1, 2011 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $45,357 $47,211 $49,065 $50,919 $52,773 $54,627 $56,481 $58,335 $60,189 $62,043 $63,897 10-month contract $37,798 $39,343 $40,888 $42,433 $43,978 $45,523 $47,068 $48,613 $50,158 $51,703 $53,248 May 1, 2012 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $46,264 $48,155 $50,046 $51,937 $53,828 $55,719 $57,610 $59,501 $61,392 $63,283 $65,174 10-month contract $38,553 $40,129 $41,705 $43,281 $44,857 $46,433 $48,008 $49,584 $51,160 $52,736 $54,312 May 1, 2013 Base 1 2 3 4 5 6 7 8 9 10 12-month contract $47,189 $49,118 $51,047 $52,976 $54,905 $56,834 $58,763 $60,692 $62,621 $64,550 $66,479 10-month contract $39,324 $40,932 $42,539 $44,147 $45,754 $47,362 $48,969 $50,577 $52,184 $53,792 $55,399 Salary scale excludes 4% vacation pay.
Thanks
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.