- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a set of terms (or keywords to be more precise) that belongs to a category called Restaurants in my Restaurant data set as shown below:
RESTAURANTS
starbucks
mcdonald
chipotle
taco bell
burger king
panera
aramark
seamless xxx-xxx-xxxx
papa john
pizza hut
reload
mexican
subway
grill
jimmy john
sonic drive
panda express
snack soda
sushi
burger
domino
canteen
diner
wendys
bbq
pizza
restaurant
kfc
cuisine
café
bagel
wendy
pizzeria
bistro
tavern
bakery
buffalo
snack
deli
dunkin
seamless
express payment service
john
bell
bar
chili
bar
rest
pub
taco
sonic
papa
jimmy
q07
panda
soda
I have another data set called Transaction which has text data describing about the transaction details. I need to categorize every row in the transaction data set into a category called "Restaurant" or "Other" based on the relationship between the terms contained within the description and the terms that I already have in the Restaurant data set. The final output should be of the format given below:
DESCRIPTION Predicted_Category Word Matched Word Linked Weightage
ADJUSTMENT-RETURNS Other N/A N/A 0%
Citibank Online Ref. #XXXXXX- CASH ADV Other N/A N/A 0%
USAA P&C PREMIUM Other N/A N/A 0%
USAA CO SPG CAXXXXXXXX COLORADO SPGSCO Other N/A N/A 0%
BREAD PARTNERS #4 MIAMI FL Restaurant Restaurant BREAD 95%
TOPSY'S KITCHEN PETALUMA CA Restaurant Restaurant KITCHEN 98%
In reference to the sample output that I've given, Word Matched is picked up from the Restaurants data set and Word Linked is picked up from the Description column of the transaction dataset. The values in the weightage column is based on how close the terms are related to each other. As you can see in the sample output, the description which has the keyword "Kitchen" is categorized to Restaurant based on the linkage between the words restaurant and kitchen. Can anyone guide me how to draw the diagram in the SAS enterprise miner to obtain this result? What are all the nodes that I can use and how to do the supervised learning based on the data set that I already have?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Great. Glad it worked!
I'm not sure about the negation operator. I just tried now for the first time and it didn't work out.
But you could use the topic indicator variables from the TT and the Transformation node (or a SAS Code node) to derive new topics with expressions like:
if textTopic_1 and not textTopic_2 then newTopic = 1;
Then select documents using the new topic indicator.
Ray
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure I follow the match you describe, since 'Restaurants' occurs does not occur in any of your documents ('descriptions').
But take look at the Text Topic node. It allows you to create custom topics (defined by a set of terms and weights that you assign to each term) and score your documents based on their association with each topic.
So if you extended your terms list for Restaurants to include 'bread', 'kitchen', etc. you can define a user topic and get each document's score on that topic.
A typical flow would be
1. text import
2. text parsing
3 text topic
Hope this helps,
Ray
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ok. Now I have my source data with set of text documents as TEXT input. I've used text parsing node to cleanse the data. Then I connected the TEXT PARSING node to TEXT TOPIC node. The TEXT TOPIC node has user defined topic called Restaurant with set of terms in it. Now I want to compare each document from the Source data to the user defined topics in Text topic node. In this case, I am not able to do it because when I run the text topic node, the node runs by itself and creates a new topic which is basically the words/phrases picked out from the content in the document. I've attached the snippets of the transformation, user definition in text topic node and the result. How can I compare each document from the source to the user defined terms in the topic?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm away from EM / Text Miner at the moment, but please try these things:
1. Assign a role of 'Noun' to each term and a numeric weight as well. The weights (0 to 1) indicate the representiveness of each term to the topic. But in a pinch you could probably set all weights to 1, which means they are all equally representative.
2. The TT node derives topics (25, I think) by default but you can turn off that feature by setting it to zero in Node Properties. (But you should be able to combine user-defined and derived topics. This is just hide to topics you aren't interested in)
3. When the TT node finishes, look in Exported Data (node properties)!to see the scored documents.
Let me know how that goes, OK? I'll be able to follow up in the morning (EST).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the response. I can see the document weight, score, Document ID and Description when I browse the train set from the exported data. Now the concern is lying on the user topic that the EM picks up. I am having a total of more than 60 user defined topics. But still it picks up only the last topic that I enter. I've attached the snippets of my results. In the image that I've attached, it shows the results of all the documents that belongs to the topic RESTAURANTS_YOGURT. But I need to fetch the documents that are related to all the topics. Of course, the documents do have keywords in it that are included as terms in the user defined topics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think you are very close.
Topics can drop out if they don't apply to any documents. But I'm guessing the issue is with term weights. Please confirm that a nonzero weight (e.g. 1) is assigned to each term.
I can replicate the behavior you are seeing by assigning term weights of zero.
Ray
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so very much. I removed the node off the diagram and added again. I got it now. Some weird behavior by the tool. Is there any way that I can use negation operator (~) to define the terms in the user-defined topics? For example, I need to fetch the documents that has the keyword SODA but not SCOTCH. Something like this - SODA & ~SCOTCH. I got this specific format from the Rule Builder Node and I would like to use the same format in defining the terms in each topic in the TT node as well. And Thank you so very much for the lead. I appreciate it. 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Great. Glad it worked!
I'm not sure about the negation operator. I just tried now for the first time and it didn't work out.
But you could use the topic indicator variables from the TT and the Transformation node (or a SAS Code node) to derive new topics with expressions like:
if textTopic_1 and not textTopic_2 then newTopic = 1;
Then select documents using the new topic indicator.
Ray
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just adding a snippet of output (exported data) from the Text Topic node for illustration.
The columns are topic weight, topic score (1/0), and document ID.
Ray