turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Text Analytics
- /
- How to interpret the scoring data result of Text R...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-10-2013 06:36 PM

Hello,

I am new to SAS text miner. I had built a categorization model (8 category values for target variable) in Text Rule Builder node with training data and now I am trying to score new data. I export the scoring data and tried to understand each variable that was given by the model. Here is my scoring output data.

Document | Predicted: Target=TV | Predicted: Target=Food | Predicted: Target=Biology | Predicted: Target=Politics | Predicted: Target=Sport | Predicted: Target=Computer | Predicted: Target=Game | Predicted: Target=Finance | Why Into: L2 | Into: Target | Probability of Classification |

Text data here 1… | 1.02E-05 | 1.20E-05 | 3.68E-04 | 0.004349291 | 1.06E-05 | 0.025836599 | 0.969411505 | 1.41E-06 | 32 | Biology | 0.969411505 |

Text data here 2… | 0.00160021 | 3.44E-04 | 0.053080483 | 0.019703281 | 5.09E-04 | 0.697857485 | 0.225630023 | 0.00127548 | 53 | Politics | 0.697857485 |

Text data here 3… | 0.035478096 | 0.034664386 | 0.023956528 | 0.07893037 | 0.066393264 | 0.174361862 | 0.332943365 | 0.253272129 | 175 | Game | 0.332943365 |

Text data here 4… | 5.98E-06 | 7.80E-06 | 0.011925956 | 8.08E-05 | 2.14E-05 | 3.44E-04 | 0.987613772 | 2.86E-07 | 31 | Biology | 0.987613772 |

Text data here 5… | 0.0103223 | 0.010505784 | 0.015682763 | 0.046567258 | 0.032731622 | 0.129644172 | 0.101578312 | 0.652967789 | . | Finance | 0.652967789 |

Text data here 6… | 0.012279932 | 0.019225879 | 0.00598867 | 0.104194986 | 0.033799801 | 0.065556881 | 0.621594848 | 0.137359003 | 149 | Game | 0.621594848 |

At first I thought since the “Probability of Classification” is the biggest number from the 8 “Predict: Target= ” variables ( which to my understanding it is the posterior probability of that category being assigned), the document should be assigned to the target category with the largest variable but obviously I am wrong. For example, the first obs has “**Predicted: Target=Game**” value of 0.9694 which is the largest number but this document was assigned to Biology. So how should I interpret those “Predict: Target= ” variable numbers? How can I get the probability or membership-like number of each document to see how much does it belong to each of these 8 categories?

Thanks,

Eric

Accepted Solutions

Solution

10-16-2013
03:02 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to EricWoo

10-16-2013 03:02 PM

Hi Eric,

Sometimes translating numerical output into words is a challenge, so let me try my best to offer as clear an explanation I can at the moment. You have posted a good example - thank you.

First, let me note some facts about the Text Rule Builder node that is helpful to keep in mind.

(1) This is a predictive model that creates Boolean rules and assesses the predictive power of each rule as well as the overall classification rates.

(2) Every rule has a posterior probability associated with each Target Level. So Rule #1 (e.g. "sports" and not "channel") has a [potentially different] posterior probability associated with Target=TV, Target=Food, Target-Sports, etc, from your example. These posterior probabilities are based on the training data which trained the predictive model.

(3) The classification is based on the rule-assessments, which are evaluated with a binary response. That is, each rule resulted in a 'True' or 'False' outcome. Or pass/fail or 0/1, if you prefer.

You example shows a column with:

(a) your input variables. In this case, that is only one column: "Document".

(b) *n *"Predicted: Target=*target level*" columns which provide a posterior probability that the record belongs to this target level. In this case, there are 8 target levels: TV, Food, ..., Game, Finance. The probabilities in these columns are a result of a naive Bayes algorithm which use the values from (2) above.

(c) a "Why" column which indicates the rule number that determined the 'assigned' Target value. "32" stands for the 32nd rule, which can be found in the Text Rule Builder Results (without number label) or in the *textrule_conj_rule* table under the field *conj_id.*

(d) an "Into: Target" column, which specifies the assigned Target Value for each record.

(e) a column with the maximum value of the *n* "Predicted: Target=" values, or the maximum of the values in the columns from (b) above.

Now I've finally set the stage for a respone to your questions.

As your intuition tells you, most of the time, the target level with the highest probability in the "Predicted: Target=" column corresponds with the Target level that the Text Rule Builder node assigns. In other words, most of the time, the Text Rule Builder will assign the value that corresponds to the category with the "highest probability in which it belongs". *Usually, this close to a 1:1 pattern - my experience with having a good sized training and 'score' sets are all 90+%.*

So why would the assigned Target level ever *not* correspond to the target value with the highest posterior probability?

Answer: It is all about the rules! The predictive model will evaluate each of these Boolean rules and assign the category/target level for the triggering rule. For example: *If the text contains "cell" and not "phone" and "reproduction" the assign the Target level = Biology*.

The probabilities are not probabilities of the category that the document would be *assigned. *Since we are talking about Boolean rules, those values would be a matrix of 0s and 1s. Instead, the probabilities are a prediction of where the document *may belong* based on the posterior probabilities of the rules. This is also still a good predictor of the *assigned *category, so it can be used as such, to answer your last question.

In practice, I think it is fair to say that those records which are assigned to a category other than the one with the highest probability are good candidates for review. They may indicate reasons to augment the training data set and rerun the node, or it may represent a document that may cross two categories. For example, an article about game theory applied to ecosystems might explain your first example. Or something about cyber espionage may cross "Computer" and "Politics" such as in your second.

Hopefully the clarification - that the probabilities represent a prediction of category based upon posterior probabilities where the assignment is based upon True/False rules - is helpful.

Thanks,

Justin

All Replies

Solution

10-16-2013
03:02 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to EricWoo

10-16-2013 03:02 PM

Hi Eric,

Sometimes translating numerical output into words is a challenge, so let me try my best to offer as clear an explanation I can at the moment. You have posted a good example - thank you.

First, let me note some facts about the Text Rule Builder node that is helpful to keep in mind.

(1) This is a predictive model that creates Boolean rules and assesses the predictive power of each rule as well as the overall classification rates.

(2) Every rule has a posterior probability associated with each Target Level. So Rule #1 (e.g. "sports" and not "channel") has a [potentially different] posterior probability associated with Target=TV, Target=Food, Target-Sports, etc, from your example. These posterior probabilities are based on the training data which trained the predictive model.

(3) The classification is based on the rule-assessments, which are evaluated with a binary response. That is, each rule resulted in a 'True' or 'False' outcome. Or pass/fail or 0/1, if you prefer.

You example shows a column with:

(a) your input variables. In this case, that is only one column: "Document".

(b) *n *"Predicted: Target=*target level*" columns which provide a posterior probability that the record belongs to this target level. In this case, there are 8 target levels: TV, Food, ..., Game, Finance. The probabilities in these columns are a result of a naive Bayes algorithm which use the values from (2) above.

(c) a "Why" column which indicates the rule number that determined the 'assigned' Target value. "32" stands for the 32nd rule, which can be found in the Text Rule Builder Results (without number label) or in the *textrule_conj_rule* table under the field *conj_id.*

(d) an "Into: Target" column, which specifies the assigned Target Value for each record.

(e) a column with the maximum value of the *n* "Predicted: Target=" values, or the maximum of the values in the columns from (b) above.

Now I've finally set the stage for a respone to your questions.

As your intuition tells you, most of the time, the target level with the highest probability in the "Predicted: Target=" column corresponds with the Target level that the Text Rule Builder node assigns. In other words, most of the time, the Text Rule Builder will assign the value that corresponds to the category with the "highest probability in which it belongs". *Usually, this close to a 1:1 pattern - my experience with having a good sized training and 'score' sets are all 90+%.*

So why would the assigned Target level ever *not* correspond to the target value with the highest posterior probability?

Answer: It is all about the rules! The predictive model will evaluate each of these Boolean rules and assign the category/target level for the triggering rule. For example: *If the text contains "cell" and not "phone" and "reproduction" the assign the Target level = Biology*.

The probabilities are not probabilities of the category that the document would be *assigned. *Since we are talking about Boolean rules, those values would be a matrix of 0s and 1s. Instead, the probabilities are a prediction of where the document *may belong* based on the posterior probabilities of the rules. This is also still a good predictor of the *assigned *category, so it can be used as such, to answer your last question.

In practice, I think it is fair to say that those records which are assigned to a category other than the one with the highest probability are good candidates for review. They may indicate reasons to augment the training data set and rerun the node, or it may represent a document that may cross two categories. For example, an article about game theory applied to ecosystems might explain your first example. Or something about cyber espionage may cross "Computer" and "Politics" such as in your second.

Hopefully the clarification - that the probabilities represent a prediction of category based upon posterior probabilities where the assignment is based upon True/False rules - is helpful.

Thanks,

Justin

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JustinPlumley

10-18-2013 04:47 PM

Hi Justin,

Thank you so much for giving this excellent explanation which I can't even find in the SAS EM help documentation! Your explanation is logically arranged and easy to understand!I got most of your points but still I need to dig into it a little bit.

If I understand you correctly, this output contains two algorithms to predict the target categories: One is based on the posterior probability given by Naive Bayes, another is based on the rules created by the Text Rule Builder node. And the "into: Target" column represents the predicting result of the rules-based prediction. Correct me if I am wrong please.

I have experience creating Naive Bayes Classifier in Python NLTK for machine learning before so I have no problem with the Naive Bayes, prior and posterior probability things. My follow-up questions would be:

- how does SAS Text Rule Builder node come up with those rules? Aren't these Boolean rules derived from Naive Bayes likelihood P(word|category) ?
- In the Result>output windows of Text rule builder, it gave the Target Percentage (Precision) and Outcome Percentage (Recall) for training and validation datasets and I mainly use these numbers to evaluate the classification result. So on which algorithms (Naive Bayes probability prediction or Boolean text rules) does these numbers based?

Your replies are appreciated! Thanks,

Eric

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to EricWoo

10-31-2013 05:10 PM

Hi Eric,

You have it.

In response to your first question, the Text Rule Builder node uses a sequential approach of examining terms/combinations for those with the highest estimated precision, and iteratively looks at smaller and smaller subsets of the data after removing matches for earlier rules.

The Precision and Recall are based on the Boolean rules (all rules up to the line that you are looking at in particular) for their calculation. It is easiest to tell with the first rules since the arithmetic is easier to check quickly.

Thanks,

Justin

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to JustinPlumley

11-01-2013 12:47 PM

Thank you Justin. You are the master of it!