Solved: Re: How to interpret the scoring data result of Text Rule Builder

EricWoo · Posted 10-10-2013 06:36 PM

Hello,

I am new to SAS text miner. I had built a categorization model (8 category values for target variable) in Text Rule Builder node with training data and now I am trying to score new data. I export the scoring data and tried to understand each variable that was given by the model. Here is my scoring output data.

Document	Predicted: Target=TV	Predicted: Target=Food	Predicted: Target=Biology	Predicted: Target=Politics	Predicted: Target=Sport	Predicted: Target=Computer	Predicted: Target=Game	Predicted: Target=Finance	Why Into: L2	Into: Target	Probability of Classification
Text data here 1…	1.02E-05	1.20E-05	3.68E-04	0.004349291	1.06E-05	0.025836599	0.969411505	1.41E-06	32	Biology	0.969411505
Text data here 2…	0.00160021	3.44E-04	0.053080483	0.019703281	5.09E-04	0.697857485	0.225630023	0.00127548	53	Politics	0.697857485
Text data here 3…	0.035478096	0.034664386	0.023956528	0.07893037	0.066393264	0.174361862	0.332943365	0.253272129	175	Game	0.332943365
Text data here 4…	5.98E-06	7.80E-06	0.011925956	8.08E-05	2.14E-05	3.44E-04	0.987613772	2.86E-07	31	Biology	0.987613772
Text data here 5…	0.0103223	0.010505784	0.015682763	0.046567258	0.032731622	0.129644172	0.101578312	0.652967789	.	Finance	0.652967789
Text data here 6…	0.012279932	0.019225879	0.00598867	0.104194986	0.033799801	0.065556881	0.621594848	0.137359003	149	Game	0.621594848

At first I thought since the “Probability of Classification” is the biggest number from the 8 “Predict: Target= ” variables ( which to my understanding it is the posterior probability of that category being assigned), the document should be assigned to the target category with the largest variable but obviously I am wrong. For example, the first obs has “Predicted: Target=Game” value of 0.9694 which is the largest number but this document was assigned to Biology. So how should I interpret those “Predict: Target= ” variable numbers? How can I get the probability or membership-like number of each document to see how much does it belong to each of these 8 categories?

Thanks,

Eric

JustinPlumley · Posted 10-16-2013 03:02 PM

Hi Eric,

Sometimes translating numerical output into words is a challenge, so let me try my best to offer as clear an explanation I can at the moment. You have posted a good example - thank you.

First, let me note some facts about the Text Rule Builder node that is helpful to keep in mind.

(1) This is a predictive model that creates Boolean rules and assesses the predictive power of each rule as well as the overall classification rates.

(2) Every rule has a posterior probability associated with each Target Level. So Rule #1 (e.g. "sports" and not "channel") has a [potentially different] posterior probability associated with Target=TV, Target=Food, Target-Sports, etc, from your example. These posterior probabilities are based on the training data which trained the predictive model.

(3) The classification is based on the rule-assessments, which are evaluated with a binary response. That is, each rule resulted in a 'True' or 'False' outcome. Or pass/fail or 0/1, if you prefer.

You example shows a column with:

(a) your input variables. In this case, that is only one column: "Document".

(b) n "Predicted: Target=target level" columns which provide a posterior probability that the record belongs to this target level. In this case, there are 8 target levels: TV, Food, ..., Game, Finance. The probabilities in these columns are a result of a naive Bayes algorithm which use the values from (2) above.

(c) a "Why" column which indicates the rule number that determined the 'assigned' Target value. "32" stands for the 32nd rule, which can be found in the Text Rule Builder Results (without number label) or in the textrule_conj_rule table under the field conj_id.

(d) an "Into: Target" column, which specifies the assigned Target Value for each record.

(e) a column with the maximum value of the n "Predicted: Target=" values, or the maximum of the values in the columns from (b) above.

Now I've finally set the stage for a respone to your questions.

As your intuition tells you, most of the time, the target level with the highest probability in the "Predicted: Target=" column corresponds with the Target level that the Text Rule Builder node assigns. In other words, most of the time, the Text Rule Builder will assign the value that corresponds to the category with the "highest probability in which it belongs". Usually, this close to a 1:1 pattern - my experience with having a good sized training and 'score' sets are all 90+%.

So why would the assigned Target level ever not correspond to the target value with the highest posterior probability?

Answer: It is all about the rules! The predictive model will evaluate each of these Boolean rules and assign the category/target level for the triggering rule. For example: If the text contains "cell" and not "phone" and "reproduction" the assign the Target level = Biology.

The probabilities are not probabilities of the category that the document would be assigned. Since we are talking about Boolean rules, those values would be a matrix of 0s and 1s. Instead, the probabilities are a prediction of where the document may belong based on the posterior probabilities of the rules. This is also still a good predictor of the assigned category, so it can be used as such, to answer your last question.

In practice, I think it is fair to say that those records which are assigned to a category other than the one with the highest probability are good candidates for review. They may indicate reasons to augment the training data set and rerun the node, or it may represent a document that may cross two categories. For example, an article about game theory applied to ecosystems might explain your first example. Or something about cyber espionage may cross "Computer" and "Politics" such as in your second.

Hopefully the clarification - that the probabilities represent a prediction of category based upon posterior probabilities where the assignment is based upon True/False rules - is helpful.

Thanks,

Justin

View solution in original post

JustinPlumley · Posted 10-16-2013 03:02 PM