BookmarkSubscribeRSS Feed
marcelo_higasi
Fluorite | Level 6

Hello,

I need to extract data from PDF documents. Is there a way to do it using some SAS procedure or SAS coding.

I saw a case where R was required. Unfortunately this is not an option for me my company would not allow use of this software.

 

I saw a module called SAS® Text Miner 14.2. It seems to handle PDF but I am not sure if it requires a separate license for it.

 

Does anyone know?

 

Thank you,

Marcelo

13 REPLIES 13
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, text miner will be a licensed product, contact SAS for pricing.

Why is R not an option?  Its free, and if it does the job use it.

In normal SAS, no, there is no simple way of reading a PDF.  Extracting data from PDFs is a very complex and tricky process, and highly recommend to not go down that route.  Return to the source data, or if that is not possible, requisition some data entry.

marcelo_higasi
Fluorite | Level 6

Thank you for the quick response.

R is not available at my company. I will need an alternative solution.

Kind regards

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Put it on a pen drive, it can be portable:

https://sourceforge.net/projects/rportable/

 

Reeza
Super User

My work has all USB connections blocked. 😞 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

IT are not your enemy, there will be a way of getting the required software, just ask them.  Much the same as you would need to get Adobe, or Text Miner or something else.  

marcelo_higasi
Fluorite | Level 6

Thank you for your help. In may case getting something out of the "official list" is discouraging. In any case will see what can be done. Kind regards


@RW9 wrote:

IT are not your enemy, there will be a way of getting the required software, just ask them.  Much the same as you would need to get Adobe, or Text Miner or something else.  



 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ahh, then that is easy.  Your response would be:

PDF is not a datasource, it is next to impossible to extract anything from it.  Therefore there are three options:

1) Go back to source and get appropriate data

2) Assign/hire someone to data entry all the data from the pdf

3) Aquire tools to do such a task

 

Its up to your company which they choose, but saying none of those is possible, makes your end impossible.

 

marcelo_higasi
Fluorite | Level 6

My work has all USB connections blocked too! 😞


@RW9 wrote:

Put it on a pen drive, it can be portable:

https://sourceforge.net/projects/rportable/

 


 

surajmetha55
Fluorite | Level 6

You need not, need to Install R in your PC. You can directly use it on cloud platform. See the following link, might be helpfull. 

link: https://rstudio.cloud/

Reeza
Super User

Adobe Professional has the capability to transfer the text/data out and that's the easiest and most accurate method I've found. Besides using Nvivo or a text mining tool. 

 


@marcelo_higasi wrote:

Hello,

I need to extract data from PDF documents. Is there a way to do it using some SAS procedure or SAS coding.

I saw a case where R was required. Unfortunately this is not an option for me my company would not allow use of this software.

 

I saw a module called SAS® Text Miner 14.2. It seems to handle PDF but I am not sure if it requires a separate license for it.

 

Does anyone know?

 

Thank you,

Marcelo


 

marcelo_higasi
Fluorite | Level 6

Using PDF professional seems like a possible solution. Thank you!

SundareshS
Obsidian | Level 7
Hi Marcelo,

As far as I am aware, the pdf conversion in Text miner is based on Apache Tika (https://tika.apache.org/) . I would consider these as a set of (Java based) programs which help in extracting data from a number of different document formats - pdfs, ppts, doc files etc.
You do not need to have Text Miner specifically to access Tika - if you explore your licences and happen to notice "Document Conversion Server" among your registered products - you may still be able to call the Tika program from the location / port where document conversion server is running.
In any case, you always have an option of installing and calling Tika from the command line interface. (It is a pretty lightweight utility)
CraigDeVault
SAS Employee

Besides PDF files, there are several types of files that SAS can read in.  You can see a full list at the following URL:

   http://go.documentation.sas.com/?docsetId=tmref&docsetTarget=n1f1hnf1pk8w3in1i2h4v94rty2m.htm&docset...

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 9431 views
  • 5 likes
  • 6 in conversation