- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Does anyone have experience using DOS command within SAS to convert .PDF files to .TXT files so that it can be read back into SAS? I have heard that you have to put sas to "sleep" during the DOS command, then use an X statement. Thank you for any help!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't know of any SAS resource to do this, which doesn't mean it doesn't exist, but here's a blog post I recently came across that covers some tools that do:
Tools for Extracting Data From PDFs — Scott Murray — alignedleft
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't know of any SAS resource to do this, which doesn't mean it doesn't exist, but here's a blog post I recently came across that covers some tools that do:
Tools for Extracting Data From PDFs — Scott Murray — alignedleft
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I was hoping for a SAS exacutable program that runs start to finish, with libname pointing to the .pdf's in question, executing a conversion, then (part 2), pulling the text items into a SAS dataset. Part 2 is managable. Because of the restricted invironment, no outside software is allowed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you're in an enterprise environment you're more likely to have access to Adobe Professional though. What does your PDF look like?
Adobe has some scripting tools that allow you to batch process something things relatively painlessly. It helps if you know some javascript though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have code to pull the PDF from the following website. It seems the PDF was created directly from excel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If it was me:
1. Batch download all files
2. Use Adobe Professional to save as Excel file or XML, which it does nicely
3. Use SAS to extract information from Excel files.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Depends upon what program you are calling in DOS and how they have to interact with SAS. I've had extremely good success with the products from: Batch extract PDF Form Data. [A-PDF.com] and I've been able to put the calls in the process flow without having to forcing SAS to sleep.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Jake: I took a closer look at the files you are trying to download and I doubt if any pdf converter would know how to correctly convert the second page of each of the pdfs on that site.
i.e., one could easily write vb script (to run have SAS run) that (1) opened Adobe Reader; (2) did a select all (i.e., ctrl-A); (3) copied the text to your system's notepad (ctrl-C); (4) opened notepad; (5) pasted the clipbrd to notepad (ctrl-V); went back to Adobe and selected the next page (down arrow); repeated the copy/paste steps; (6) saved the notepad file; and (7) had sas open the txt file that was created and parsed its contents.
The first 75% of the file would be easy to parse as all of the desired text starts with the headers:
Auction Date: September 04, 2014
LOT NO. SAMPLE DESCRIPTION MOISTURE PROTEIN RFV CUTTING LOAD SIZE PRICE
and the data that follows the headers is rather straight forward:
869 Large Round 14.96 20.48 82.78 1 15.48 75.00
However, the last approximately 25% of the file didn't make sense to me given the header variables:
872 Medium Square STRAW 78 Bales $ 2 5.00
If those latter lines are all irrelevant, then the problem would be easy to solve.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I've only have experience on extracting text from PDF or converting PDF to Word for getting text, But I've no idea on converting PDF to TXT directly.
I'm also looking forward to learn a solution for it.
Any other ideas?