Solved: Text mining on description data

mganesh10 · Posted 08-02-2016 01:58 PM

I have a set of terms (or keywords to be more precise) that belongs to a category called Restaurants in my Restaurant data set as shown below:

RESTAURANTS
 starbucks
 mcdonald
 chipotle
 taco bell
 burger king
 panera
 aramark
 seamless xxx-xxx-xxxx
 papa john
 pizza hut
 reload
 mexican
 subway
 grill
 jimmy john
 sonic drive
 panda express
 snack soda
 sushi
 burger
 domino
 canteen
 diner
 wendys
 bbq
 pizza
 restaurant
 kfc
 cuisine
 café
 bagel
 wendy
 pizzeria
 bistro
 tavern
 bakery
 buffalo
 snack
 deli
 dunkin
 seamless
 express payment service
 john
bell
 bar
 chili
 bar
 rest
 pub
 taco
 sonic
 papa
 jimmy
 q07
 panda
 soda

I have another data set called Transaction which has text data describing about the transaction details. I need to categorize every row in the transaction data set into a category called "Restaurant" or "Other" based on the relationship between the terms contained within the description and the terms that I already have in the Restaurant data set. The final output should be of the format given below:

DESCRIPTION                                 Predicted_Category  Word Matched    Word Linked    Weightage
ADJUSTMENT-RETURNS                          Other                N/A               N/A             0%
Citibank Online Ref. #XXXXXX- CASH ADV      Other                N/A               N/A             0%
USAA P&C PREMIUM                            Other                N/A               N/A             0%
USAA CO SPG CAXXXXXXXX   COLORADO SPGSCO    Other                N/A               N/A             0%
BREAD PARTNERS #4 MIAMI FL                  Restaurant           Restaurant        BREAD           95%
TOPSY'S KITCHEN PETALUMA CA                 Restaurant           Restaurant        KITCHEN         98%

In reference to the sample output that I've given, Word Matched is picked up from the Restaurants data set and Word Linked is picked up from the Description column of the transaction dataset. The values in the weightage column is based on how close the terms are related to each other. As you can see in the sample output, the description which has the keyword "Kitchen" is categorized to Restaurant based on the linkage between the words restaurant and kitchen. Can anyone guide me how to draw the diagram in the SAS enterprise miner to obtain this result? What are all the nodes that I can use and how to do the supervised learning based on the data set that I already have?

rayIII · Posted 08-03-2016 04:05 PM

Great. Glad it worked!

I'm not sure about the negation operator. I just tried now for the first time and it didn't work out.

But you could use the topic indicator variables from the TT and the Transformation node (or a SAS Code node) to derive new topics with expressions like:

if textTopic_1 and not textTopic_2 then newTopic = 1;

Then select documents using the new topic indicator.

Ray

View solution in original post

rayIII · Posted 08-02-2016 05:31 PM

I'm not sure I follow the match you describe, since 'Restaurants' occurs does not occur in any of your documents ('descriptions').

But take look at the Text Topic node. It allows you to create custom topics (defined by a set of terms and weights that you assign to each term) and score your documents based on their association with each topic.

So if you extended your terms list for Restaurants to include 'bread', 'kitchen', etc. you can define a user topic and get each document's score on that topic.

A typical flow would be

1. text import

2. text parsing

3 text topic

Hope this helps,

Ray

mganesh10 · Posted 08-02-2016 07:24 PM

Ok. Now I have my source data with set of text documents as TEXT input. I've used text parsing node to cleanse the data. Then I connected the TEXT PARSING node to TEXT TOPIC node. The TEXT TOPIC node has user defined topic called Restaurant with set of terms in it. Now I want to compare each document from the Source data to the user defined topics in Text topic node. In this case, I am not able to do it because when I run the text topic node, the node runs by itself and creates a new topic which is basically the words/phrases picked out from the content in the document. I've attached the snippets of the transformation, user definition in text topic node and the result. How can I compare each document from the source to the user defined terms in the topic?

rayIII · Posted 08-02-2016 08:03 PM

I'm away from EM / Text Miner at the moment, but please try these things:

1. Assign a role of 'Noun' to each term and a numeric weight as well. The weights (0 to 1) indicate the representiveness of each term to the topic. But in a pinch you could probably set all weights to 1, which means they are all equally representative.

2. The TT node derives topics (25, I think) by default but you can turn off that feature by setting it to zero in Node Properties. (But you should be able to combine user-defined and derived topics. This is just hide to topics you aren't interested in)

3. When the TT node finishes, look in Exported Data (node properties)!to see the scored documents.

Let me know how that goes, OK? I'll be able to follow up in the morning (EST).

mganesh10 · Posted 08-03-2016 02:00 PM

Thanks for the response. I can see the document weight, score, Document ID and Description when I browse the train set from the exported data. Now the concern is lying on the user topic that the EM picks up. I am having a total of more than 60 user defined topics. But still it picks up only the last topic that I enter. I've attached the snippets of my results. In the image that I've attached, it shows the results of all the documents that belongs to the topic RESTAURANTS_YOGURT. But I need to fetch the documents that are related to all the topics. Of course, the documents do have keywords in it that are included as terms in the user defined topics.

rayIII · Posted 08-03-2016 02:42 PM

I think you are very close.

Topics can drop out if they don't apply to any documents. But I'm guessing the issue is with term weights. Please confirm that a nonzero weight (e.g. 1) is assigned to each term.

I can replicate the behavior you are seeing by assigning term weights of zero.

Ray

mganesh10 · Posted 08-03-2016 02:51 PM

Thank you so very much. I removed the node off the diagram and added again. I got it now. Some weird behavior by the tool. Is there any way that I can use negation operator (~) to define the terms in the user-defined topics? For example, I need to fetch the documents that has the keyword SODA but not SCOTCH. Something like this - SODA & ~SCOTCH. I got this specific format from the Rule Builder Node and I would like to use the same format in defining the terms in each topic in the TT node as well. And Thank you so very much for the lead. I appreciate it. 🙂

rayIII · Posted 08-03-2016 04:05 PM

Great. Glad it worked!

I'm not sure about the negation operator. I just tried now for the first time and it didn't work out.

But you could use the topic indicator variables from the TT and the Transformation node (or a SAS Code node) to derive new topics with expressions like:

if textTopic_1 and not textTopic_2 then newTopic = 1;

Then select documents using the new topic indicator.

Ray

rayIII · Posted 08-02-2016 05:37 PM

Just adding a snippet of output (exported data) from the Text Topic node for illustration.

The columns are topic weight, topic score (1/0), and document ID.

Ray

Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Re: Text mining on description data

Catch up on SAS Innovate 2026