BookmarkSubscribeRSS Feed
ekow_atta
Calcite | Level 5

I am a research student and new to SAS Text Analytics and would like some help with data preparation.I need recommendations on some resources I can use.Thanks

10 REPLIES 10
JuliaM
Calcite | Level 5

Can you tell me what you mean by data preparation?

From my experience, the SAS Text Analytics can only "read" text files. So any PDFs or Word documents for example, will need to be converted to text before the SAS TA will be able to output. Does this answer your question, or was there more to it than that?

ekow_atta
Calcite | Level 5

Do you have experience with SAS OnDemand Enterprise Miner.

art297
Opal | Level 21

Text miner can read numerous types of files including web pages.  The analytics will work better if you can eliminate some of the "noise" but, of course, that isn't always possible.

FionaMcNeill
SAS Employee

Yes Aurthur - absolutely. Just to add a bit, document conversion is included in all the SAS Text Analytics products - to translate from Word, .pdf, etc... to a normalized .txt format.  The Text Import node in Text Miner will crawl web pages, and retrieve documents from any site defined. For Categorization and Sentiment Analysis, web sites and file systems can be crawled and the information retrieved using SAS Informaiton Retrieval Studio - which for crawling is included with Sentiment Analysis, and is an add-on component of the Categorization bundle.

Not knowing what type of data you are dealing with, have you seen this: White Paper | Sifting Through the Noise of Social Media | SAS It provides some data cleansing/quality tips for social media data.

ekow_atta
Calcite | Level 5

I am working with some legal documents for my research and I have not been able to bring them into Enterprise Miner OnDemand Academics. Do you have any material to can suggest read to be able to solve this issue?

JuliaM
Calcite | Level 5

Not sure how much help this will be but try the support site for SAS OnDemand: support.sas.com/ondemand/

Does anyone else in the forum have experience with Enterprise Miner OnDemand?

jaredp
Quartz | Level 8

According to the Instruction Manual (http://support.sas.com/ondemand/manuals/AcadInstrucManual.pdf), only DATA files can be stored on the SAS Server.  You cannot store program files, course notes, slides...or in your case, PDF documents.

I am guessing that is likely the issue you are having, since your PDF's are not in a SAS data format which would then be uploaded through your account portal.

The Instruction Manual has a section "Requesting Assistance with Custom Course Data on the SAS Server".  I think you need to contact these folks and ask them how to use PDF's as a datasource (that is, how to get your PDF's into a SAS dataset).  This information seems to be lacking in the manual.  SAS Support is great, I am sure they can address your needs.


Related to this discussion.  When I took an EM SAS course, we would remote into a Windows server using RDP where EM was installed.  I am guessing that this was their OnDemand stuff in action.  While an RDP session *can* be setup to allow access to the local computers filesystem, the OnDemand may not be setup that way.

 

If it is setup that way, a person should be able to use the TM Import (or %tmfilter macro) provided in EM to connect to a folder of PDF documents and generate the SAS dataset.  This certainly would be more ideal than getting SAS support to do it for you.


I'd like to add to a previous commentor about PDF's needing to be in a Text format.  PDF documents can be in Text where you can use your cursor to select, copy and then paste to things like MS Word, Notepad, VIM (etc...).  But PDF's can also be like an Image (like a Jpeg) where Text is not selectable.  This is often the result of 1) A paper document was scanned to PDF or 2) An electronic document was converted to a PDF using a sub-standard conversion process. 

You can get software that can perform OCR (optical character recognition) on PDF documents that are an 'Image' (Adobe Acrobat Pro paid version can do this).  OCR works best on machine print fonts (times new roman, arial, etc...) as opposed to hand writing.  Although the former is possible.

ekow_atta
Calcite | Level 5

Your post has been really helpful. I have tried to create SAS datasets with the TM Import and with codes but none has worked for me. I contacted the OnDemand support and I have  not been successful yet. Can you please help with codes or any other  way I can create an SAS data format from PDF or a text file? I see you have some experience experience with OnDemand acedemics. Thanks

jaredp
Quartz | Level 8

I actually have no experience with OnDemand analytics.  But I do have experience with Enterprise Miner / Text Miner and the Text Import Node.

I'm looking at one of my projects which imports Word Doc (.doc, .docx) and PDF files (.pdf).  The PDF documents are 'text format'.

You can take two approaches, both of which require being able to point to your directory of PDF documents:

1) Text Import Node

2) %tmfilter macro

In the Text Import Node, all I did was change the following properties:

Import File Directory:

Destination Directory:

Text Size: 10,000

The Import directory is where your PDF documents are located.  The export directory is where the resulting SAS data (not SAS dataset).  This node creates a Text file - one for each PDF or document.  You can then proceed to attach your Text Parsing node and other nodes accordingly. 

Text Size refers to the number of characters you want from a document.  If you chose 100 but you had a PDF document with 500 characters, then your observation will be cutoff.  Only the first 100 characters will be stored in the SAS dataset.

If you want to create an actual SAS Dataset, use the %tmfilter node.  Once you have your SAS dataset, you can proceed to use that in your projects.  Code for the macro looks like this (you can run it from a SAS code node in EM):

libname a 'C:\EM_Projects\my_project_folder';

%tmfilter(dataset=a.data_want, dir=C:\EM_Projects\my_project_folder\output, numbytes=32000);

The SAS Help files in EM are a fantastic resource to learn about Nodes or the %tmfilter macro.

Once you have your dataset, you will then have to import or attach to this data from your EM project.

javidiqbal
Calcite | Level 5

Data can be brought through SAS studio

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 3056 views
  • 5 likes
  • 6 in conversation