BookmarkSubscribeRSS Feed

Tip: How to Create Models with Text Data Using SAS® Enterprise Miner™

Started ‎05-30-2014 by
Modified ‎10-06-2015 by
Views 3,471

Do you have data that you want to model with a textual component? In this tip, we look at how to create a predictive model with records that include textual data using nodes from the Text Mining tab in Enterprise Miner.

 

Example

 

In this example, we use the SAMPSIO.NEWS data set (stored in the sample library). This data set contains 600 observations that are news articles, and the target we are trying to predict is the category of the news article. We should reject the binary variables graphics, hockey, and medical, as these are just indicators of our target.

 

raw_data.png

 

The full diagram flow should look similar to the one below, though you are free to choose whichever modeling nodes you are most comfortable using. In this tip, we will discuss the first few nodes of the flow, and we will not address any general predictive modeling practices.

 

flow.png

 

In order to process your text data, you must first use the Text Parsing and Text Filter nodes as seen in the diagram flow. The Text Parsing Node takes the raw text from the data source and has the ability to parse different languages and different parts of speech. The Text Filter Node needs to immediately follow the Text Parsing Node.

 

The Text Filter Node applies filters to your text data. You can define your own dictionary, term weighting, frequency weighting, and term filters, or you can use the defaults provided in Enterprise Miner. The Text Filter Node creates a new Transaction Data Set that details which observations contain which words. This is an example of sparse data (see Tip: Working with Sparse Data in SAS). A portion of the example data set is below.

 

transaction_data.png

 

The Text Topic Node follows the Text Filter Node in the flow. The Text Topic Node uses the transaction data created by the Text Filter Node and creates “topics,” which are groups of words that are automatically determined to be related. In the results of the Text Topic Node there is a Topics table that contains a summary of information on each topic. Each topic is characterized by several key words, though the total number of terms in a topic is indicated in the Topics table. In the picture below you can see that Topic 3 is characterized by the keywords “program, file, version, keyboard, software,” but contains a total of 210 different words.

 

text_topic.png

 

The Text Topic Node also adds new variables to the original training data. Each observation is assigned a weight of how much a given topic relates to the observation. You can see in the picture of the exported training table below that Document 1 has been assigned a value of 0.0070 for the topic “ca, laurentian, cs, maynard, ramsey.” In addition to the interval variables created, the Text Topic Node also creates binary variables indicating only if a topic is represented in the observation or not. However, for many standard predictive modeling tasks you will want to use the interval variables.

 

topic_data.png

 

You can also look at the Level and Role of these new variables by looking at the properties of this exported table.

 

variables.png

 

The new variables have the level of Interval and Role of Input. This allows us to use standard predictive modeling nodes to create a model for predicting newsgroup, the target. By creating the text topics, you are extracting the useful textual information and representing it numerically so that it can be used and understood by the standard predictive modeling nodes.

 

Now you’re ready to apply your predictive modeling knowledge to the exported training table from the Text Topic Node.

 

Conclusion

 

You are not limited to observations with text only, but you can also use the steps in this tip to use data with other types of inputs in addition to the textual data. You can download the XML attached to this tip, and import it through the File menu in Enterprise Miner (File->Import diagram from XML).

 

Happy Modeling!

Version history
Last update:
‎10-06-2015 02:26 PM
Updated by:
Contributors

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags