Text mining and content categorization

Data Preparation

Reply
New Contributor
Posts: 4

Data Preparation

I am a research student and new to SAS Text Analytics and would like some help with data preparation.I need recommendations on some resources I can use.Thanks

Contributor
Posts: 36

Re: Data Preparation

Can you tell me what you mean by data preparation?

From my experience, the SAS Text Analytics can only "read" text files. So any PDFs or Word documents for example, will need to be converted to text before the SAS TA will be able to output. Does this answer your question, or was there more to it than that?

New Contributor
Posts: 4

Re: Data Preparation

Do you have experience with SAS OnDemand Enterprise Miner.

Esteemed Advisor
Posts: 7,060

Re: Data Preparation

Text miner can read numerous types of files including web pages.  The analytics will work better if you can eliminate some of the "noise" but, of course, that isn't always possible.

SAS Employee
Posts: 17

Re: Data Preparation

Yes Aurthur - absolutely. Just to add a bit, document conversion is included in all the SAS Text Analytics products - to translate from Word, .pdf, etc... to a normalized .txt format.  The Text Import node in Text Miner will crawl web pages, and retrieve documents from any site defined. For Categorization and Sentiment Analysis, web sites and file systems can be crawled and the information retrieved using SAS Informaiton Retrieval Studio - which for crawling is included with Sentiment Analysis, and is an add-on component of the Categorization bundle.

Not knowing what type of data you are dealing with, have you seen this: White Paper | Sifting Through the Noise of Social Media | SAS It provides some data cleansing/quality tips for social media data.

New Contributor
Posts: 4

Re: Data Preparation

I am working with some legal documents for my research and I have not been able to bring them into Enterprise Miner OnDemand Academics. Do you have any material to can suggest read to be able to solve this issue?

Contributor
Posts: 36

Re: Data Preparation

Not sure how much help this will be but try the support site for SAS OnDemand: support.sas.com/ondemand/

Does anyone else in the forum have experience with Enterprise Miner OnDemand?

Contributor
Posts: 71

Re: Data Preparation

According to the Instruction Manual (http://support.sas.com/ondemand/manuals/AcadInstrucManual.pdf), only DATA files can be stored on the SAS Server.  You cannot store program files, course notes, slides...or in your case, PDF documents.

I am guessing that is likely the issue you are having, since your PDF's are not in a SAS data format which would then be uploaded through your account portal.

The Instruction Manual has a section "Requesting Assistance with Custom Course Data on the SAS Server".  I think you need to contact these folks and ask them how to use PDF's as a datasource (that is, how to get your PDF's into a SAS dataset).  This information seems to be lacking in the manual.  SAS Support is great, I am sure they can address your needs.


Related to this discussion.  When I took an EM SAS course, we would remote into a Windows server using RDP where EM was installed.  I am guessing that this was their OnDemand stuff in action.  While an RDP session *can* be setup to allow access to the local computers filesystem, the OnDemand may not be setup that way.

 

If it is setup that way, a person should be able to use the TM Import (or %tmfilter macro) provided in EM to connect to a folder of PDF documents and generate the SAS dataset.  This certainly would be more ideal than getting SAS support to do it for you.


I'd like to add to a previous commentor about PDF's needing to be in a Text format.  PDF documents can be in Text where you can use your cursor to select, copy and then paste to things like MS Word, Notepad, VIM (etc...).  But PDF's can also be like an Image (like a Jpeg) where Text is not selectable.  This is often the result of 1) A paper document was scanned to PDF or 2) An electronic document was converted to a PDF using a sub-standard conversion process. 

You can get software that can perform OCR (optical character recognition) on PDF documents that are an 'Image' (Adobe Acrobat Pro paid version can do this).  OCR works best on machine print fonts (times new roman, arial, etc...) as opposed to hand writing.  Although the former is possible.

New Contributor
Posts: 4

Re: Data Preparation

Your post has been really helpful. I have tried to create SAS datasets with the TM Import and with codes but none has worked for me. I contacted the OnDemand support and I have  not been successful yet. Can you please help with codes or any other  way I can create an SAS data format from PDF or a text file? I see you have some experience experience with OnDemand acedemics. Thanks

Contributor
Posts: 71

Re: Data Preparation

I actually have no experience with OnDemand analytics.  But I do have experience with Enterprise Miner / Text Miner and the Text Import Node.

I'm looking at one of my projects which imports Word Doc (.doc, .docx) and PDF files (.pdf).  The PDF documents are 'text format'.

You can take two approaches, both of which require being able to point to your directory of PDF documents:

1) Text Import Node

2) %tmfilter macro

In the Text Import Node, all I did was change the following properties:

Import File Directory:

Destination Directory:

Text Size: 10,000

The Import directory is where your PDF documents are located.  The export directory is where the resulting SAS data (not SAS dataset).  This node creates a Text file - one for each PDF or document.  You can then proceed to attach your Text Parsing node and other nodes accordingly. 

Text Size refers to the number of characters you want from a document.  If you chose 100 but you had a PDF document with 500 characters, then your observation will be cutoff.  Only the first 100 characters will be stored in the SAS dataset.

If you want to create an actual SAS Dataset, use the %tmfilter node.  Once you have your SAS dataset, you can proceed to use that in your projects.  Code for the macro looks like this (you can run it from a SAS code node in EM):

libname a 'C:\EM_Projects\my_project_folder';

%tmfilter(dataset=a.data_want, dir=C:\EM_Projects\my_project_folder\output, numbytes=32000);

The SAS Help files in EM are a fantastic resource to learn about Nodes or the %tmfilter macro.

Once you have your dataset, you will then have to import or attach to this data from your EM project.

Regular Learner
Posts: 1

Re: Data Preparation

Data can be brought through SAS studio

Ask a Question
Discussion stats
  • 10 replies
  • 1322 views
  • 5 likes
  • 6 in conversation