Text mining and content categorization

How to interpret the scoring data result of Text Rule Builder

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

How to interpret the scoring data result of Text Rule Builder

Hello,

I am new to SAS text miner. I had built a categorization model (8 category values for target variable) in Text Rule Builder node with training data and now I am trying to score new data. I export the scoring data and tried to understand each variable that was given by the model. Here is my scoring output data.

DocumentPredicted: Target=TVPredicted: Target=FoodPredicted: Target=BiologyPredicted: Target=PoliticsPredicted: Target=SportPredicted: Target=ComputerPredicted: Target=GamePredicted: Target=FinanceWhy Into: L2Into: TargetProbability of Classification
Text data here 1…1.02E-051.20E-053.68E-040.0043492911.06E-050.0258365990.9694115051.41E-0632Biology0.969411505
Text data here 2…0.001600213.44E-040.0530804830.0197032815.09E-040.6978574850.2256300230.0012754853Politics0.697857485
Text data here 3…0.0354780960.0346643860.0239565280.078930370.0663932640.1743618620.3329433650.253272129175Game0.332943365
Text data here 4…5.98E-067.80E-060.0119259568.08E-052.14E-053.44E-040.9876137722.86E-0731Biology0.987613772
Text data here 5…0.01032230.0105057840.0156827630.0465672580.0327316220.1296441720.1015783120.652967789.Finance0.652967789
Text data here 6…0.0122799320.0192258790.005988670.1041949860.0337998010.0655568810.6215948480.137359003149Game0.621594848

At first I thought since the “Probability of Classification” is the biggest number from the 8 “Predict: Target= ” variables ( which to my understanding it is the posterior probability of that category being assigned), the document should be assigned to the target category with the largest variable but obviously I am wrong. For example, the first obs has “Predicted: Target=Game” value of 0.9694 which is the largest number but this document was assigned to Biology. So how should I interpret those “Predict: Target= ” variable numbers? How can I get the probability or membership-like number of each document to see how much does it belong to each of these 8 categories?

Thanks,

Eric


Accepted Solutions
Solution
‎10-16-2013 03:02 PM
SAS Employee
Posts: 5

Re: How to interpret the scoring data result of Text Rule Builder

Hi Eric,

Sometimes translating numerical output into words is a challenge, so let me try my best to offer as clear an explanation I can at the moment.  You have posted a good example - thank you.

First, let me note some facts about the Text Rule Builder node that is helpful to keep in mind.

(1) This is a predictive model that creates Boolean rules and assesses the predictive power of each rule as well as the overall classification rates.

(2) Every rule has a posterior probability associated with each Target Level.  So Rule #1 (e.g. "sports" and not "channel") has a [potentially different] posterior probability associated with Target=TV, Target=Food, Target-Sports, etc, from your example.  These posterior probabilities are based on the training data which trained the predictive model.

(3) The classification is based on the rule-assessments, which are evaluated with a binary response.  That is, each rule resulted in a 'True' or 'False' outcome. Or pass/fail or 0/1, if you prefer.

You example shows a column with:

(a)  your input variables.  In this case, that is only one column:  "Document".

(b)  n "Predicted: Target=target level" columns which provide a posterior probability that the record belongs to this target level.  In this case, there are 8 target levels: TV, Food, ..., Game, Finance.   The probabilities in these columns are a result of a naive Bayes algorithm which use the values from (2) above.

(c)  a "Why" column which indicates the rule number that determined the 'assigned' Target value.  "32" stands for the 32nd rule, which can be found in the Text Rule Builder Results (without number label) or in the textrule_conj_rule table under the field conj_id.

(d)  an "Into: Target" column, which specifies the assigned Target Value for each record.

(e)  a column with the maximum value of the n "Predicted: Target=" values, or the maximum of the values in the columns from (b) above.

Now I've finally set the stage for a respone to your questions.

As your intuition tells you, most of the time, the target level with the highest probability in the "Predicted: Target=" column corresponds with the Target level that the Text Rule Builder node assigns.  In other words, most of the time, the Text Rule Builder will assign the value that corresponds to the category with the "highest probability in which it belongs".  Usually, this close to a 1:1 pattern - my experience with having a good sized training and 'score' sets are all 90+%.

So why would the assigned Target level ever not correspond to the target value with the highest posterior probability?

Answer: It is all about the rules!  The predictive model will evaluate each of these Boolean rules and assign the category/target level for the triggering rule.  For example: If the text contains "cell" and not "phone" and "reproduction" the assign the Target level = Biology.

The probabilities are not probabilities of the category that the document would be assigned.  Since we are talking about Boolean rules, those values would be a matrix of 0s and 1s.  Instead, the probabilities are a prediction of where the document may belong based on the posterior probabilities of the rules.  This is also still a good predictor of the assigned category, so it can be used as such, to answer your last question.

In practice, I think it is fair to say that those records which are assigned to a category other than the one with the highest probability are good candidates for review.  They may indicate reasons to augment the training data set and rerun the node, or it may represent a document that may cross two categories.  For example, an article about game theory applied to ecosystems might explain your first example.  Or something about cyber espionage may cross "Computer" and "Politics" such as in your second.  

Hopefully the clarification - that the probabilities represent a prediction of category based upon posterior probabilities where the assignment is based upon True/False rules - is helpful.

Thanks,

Justin

View solution in original post


All Replies
Solution
‎10-16-2013 03:02 PM
SAS Employee
Posts: 5

Re: How to interpret the scoring data result of Text Rule Builder

Hi Eric,

Sometimes translating numerical output into words is a challenge, so let me try my best to offer as clear an explanation I can at the moment.  You have posted a good example - thank you.

First, let me note some facts about the Text Rule Builder node that is helpful to keep in mind.

(1) This is a predictive model that creates Boolean rules and assesses the predictive power of each rule as well as the overall classification rates.

(2) Every rule has a posterior probability associated with each Target Level.  So Rule #1 (e.g. "sports" and not "channel") has a [potentially different] posterior probability associated with Target=TV, Target=Food, Target-Sports, etc, from your example.  These posterior probabilities are based on the training data which trained the predictive model.

(3) The classification is based on the rule-assessments, which are evaluated with a binary response.  That is, each rule resulted in a 'True' or 'False' outcome. Or pass/fail or 0/1, if you prefer.

You example shows a column with:

(a)  your input variables.  In this case, that is only one column:  "Document".

(b)  n "Predicted: Target=target level" columns which provide a posterior probability that the record belongs to this target level.  In this case, there are 8 target levels: TV, Food, ..., Game, Finance.   The probabilities in these columns are a result of a naive Bayes algorithm which use the values from (2) above.

(c)  a "Why" column which indicates the rule number that determined the 'assigned' Target value.  "32" stands for the 32nd rule, which can be found in the Text Rule Builder Results (without number label) or in the textrule_conj_rule table under the field conj_id.

(d)  an "Into: Target" column, which specifies the assigned Target Value for each record.

(e)  a column with the maximum value of the n "Predicted: Target=" values, or the maximum of the values in the columns from (b) above.

Now I've finally set the stage for a respone to your questions.

As your intuition tells you, most of the time, the target level with the highest probability in the "Predicted: Target=" column corresponds with the Target level that the Text Rule Builder node assigns.  In other words, most of the time, the Text Rule Builder will assign the value that corresponds to the category with the "highest probability in which it belongs".  Usually, this close to a 1:1 pattern - my experience with having a good sized training and 'score' sets are all 90+%.

So why would the assigned Target level ever not correspond to the target value with the highest posterior probability?

Answer: It is all about the rules!  The predictive model will evaluate each of these Boolean rules and assign the category/target level for the triggering rule.  For example: If the text contains "cell" and not "phone" and "reproduction" the assign the Target level = Biology.

The probabilities are not probabilities of the category that the document would be assigned.  Since we are talking about Boolean rules, those values would be a matrix of 0s and 1s.  Instead, the probabilities are a prediction of where the document may belong based on the posterior probabilities of the rules.  This is also still a good predictor of the assigned category, so it can be used as such, to answer your last question.

In practice, I think it is fair to say that those records which are assigned to a category other than the one with the highest probability are good candidates for review.  They may indicate reasons to augment the training data set and rerun the node, or it may represent a document that may cross two categories.  For example, an article about game theory applied to ecosystems might explain your first example.  Or something about cyber espionage may cross "Computer" and "Politics" such as in your second.  

Hopefully the clarification - that the probabilities represent a prediction of category based upon posterior probabilities where the assignment is based upon True/False rules - is helpful.

Thanks,

Justin

Occasional Contributor
Posts: 7

Re: How to interpret the scoring data result of Text Rule Builder

Hi Justin,

Thank you so much for giving this excellent explanation which I can't even find in the SAS EM help documentation! Your explanation is logically arranged and easy to understand!I got most of your points but still I need to dig into it a little bit.

If I understand you correctly, this output contains two algorithms to predict the target categories: One is based on the posterior probability given by Naive Bayes, another is based on the rules created by the Text Rule Builder node. And the "into: Target" column represents the predicting result of the rules-based prediction. Correct me if I am wrong please.

I have experience creating Naive Bayes Classifier in Python NLTK for machine learning before so I have no problem with the Naive Bayes, prior and posterior probability things. My follow-up questions would be:

  1. how does SAS Text Rule Builder node come up with those rules? Aren't these Boolean rules derived from Naive Bayes likelihood P(word|category) ?
  2. In the Result>output windows of Text rule builder, it gave the Target Percentage (Precision) and Outcome Percentage (Recall) for training and validation datasets and I mainly use these numbers to evaluate the classification result. So on which algorithms (Naive Bayes probability prediction or Boolean text rules) does these numbers based?

Your replies are appreciated! Thanks,

Eric

SAS Employee
Posts: 5

Re: How to interpret the scoring data result of Text Rule Builder

Hi Eric,

You have it.

In response to your first question, the Text Rule Builder node uses a sequential approach of examining terms/combinations for those with the highest estimated precision, and iteratively looks at smaller and smaller subsets of the data after removing matches for earlier rules.

The Precision and Recall are based on the Boolean rules (all rules up to the line that you are looking at in particular) for their calculation.  It is easiest to tell with the first rules since the arithmetic is easier to check quickly.

Thanks,

Justin

Occasional Contributor
Posts: 7

Re: How to interpret the scoring data result of Text Rule Builder

Thank you Justin. You are the master of it!

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 1058 views
  • 2 likes
  • 2 in conversation