BookmarkSubscribeRSS Feed
marcelo_higasi
Fluorite | Level 6

Hello,

I need to extract data from PDF documents. Is there a way to do it using some SAS procedure or SAS coding.

I saw a case where R was required. Unfortunately this is not an option for me my company would not allow use of this software.

 

I saw a module called SAS® Text Miner 14.2. It seems to handle PDF but I am not sure if it requires a separate license for it.

 

Does anyone know?

 

Thank you,

Marcelo

13 REPLIES 13
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, text miner will be a licensed product, contact SAS for pricing.

Why is R not an option?  Its free, and if it does the job use it.

In normal SAS, no, there is no simple way of reading a PDF.  Extracting data from PDFs is a very complex and tricky process, and highly recommend to not go down that route.  Return to the source data, or if that is not possible, requisition some data entry.

marcelo_higasi
Fluorite | Level 6

Thank you for the quick response.

R is not available at my company. I will need an alternative solution.

Kind regards

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Put it on a pen drive, it can be portable:

https://sourceforge.net/projects/rportable/

 

Reeza
Super User

My work has all USB connections blocked. 😞 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

IT are not your enemy, there will be a way of getting the required software, just ask them.  Much the same as you would need to get Adobe, or Text Miner or something else.  

marcelo_higasi
Fluorite | Level 6

Thank you for your help. In may case getting something out of the "official list" is discouraging. In any case will see what can be done. Kind regards


@RW9 wrote:

IT are not your enemy, there will be a way of getting the required software, just ask them.  Much the same as you would need to get Adobe, or Text Miner or something else.  



 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ahh, then that is easy.  Your response would be:

PDF is not a datasource, it is next to impossible to extract anything from it.  Therefore there are three options:

1) Go back to source and get appropriate data

2) Assign/hire someone to data entry all the data from the pdf

3) Aquire tools to do such a task

 

Its up to your company which they choose, but saying none of those is possible, makes your end impossible.

 

marcelo_higasi
Fluorite | Level 6

My work has all USB connections blocked too! 😞


@RW9 wrote:

Put it on a pen drive, it can be portable:

https://sourceforge.net/projects/rportable/

 


 

surajmetha55
Fluorite | Level 6

You need not, need to Install R in your PC. You can directly use it on cloud platform. See the following link, might be helpfull. 

link: https://rstudio.cloud/

Reeza
Super User

Adobe Professional has the capability to transfer the text/data out and that's the easiest and most accurate method I've found. Besides using Nvivo or a text mining tool. 

 


@marcelo_higasi wrote:

Hello,

I need to extract data from PDF documents. Is there a way to do it using some SAS procedure or SAS coding.

I saw a case where R was required. Unfortunately this is not an option for me my company would not allow use of this software.

 

I saw a module called SAS® Text Miner 14.2. It seems to handle PDF but I am not sure if it requires a separate license for it.

 

Does anyone know?

 

Thank you,

Marcelo


 

marcelo_higasi
Fluorite | Level 6

Using PDF professional seems like a possible solution. Thank you!

SundareshS
Obsidian | Level 7
Hi Marcelo,

As far as I am aware, the pdf conversion in Text miner is based on Apache Tika (https://tika.apache.org/) . I would consider these as a set of (Java based) programs which help in extracting data from a number of different document formats - pdfs, ppts, doc files etc.
You do not need to have Text Miner specifically to access Tika - if you explore your licences and happen to notice "Document Conversion Server" among your registered products - you may still be able to call the Tika program from the location / port where document conversion server is running.
In any case, you always have an option of installing and calling Tika from the command line interface. (It is a pretty lightweight utility)
CraigDeVault
SAS Employee

Besides PDF files, there are several types of files that SAS can read in.  You can see a full list at the following URL:

   http://go.documentation.sas.com/?docsetId=tmref&docsetTarget=n1f1hnf1pk8w3in1i2h4v94rty2m.htm&docset...

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 8793 views
  • 5 likes
  • 6 in conversation