About JustinPlumley

JustinPlumley · ‎10-31-2013

Hi Eric, You have it. In response to your first question, the Text Rule Builder node uses a sequential approach of examining terms/combinations for those with the highest estimated precision, and iteratively looks at smaller and smaller subsets of the data after removing matches for earlier rules. The Precision and Recall are based on the Boolean rules (all rules up to the line that you are looking at in particular) for their calculation. It is easiest to tell with the first rules since the arithmetic is easier to check quickly. Thanks, Justin

JustinPlumley · ‎10-31-2013

Hi Eric, I'll try to respond to each of your questions separately and will reference your example above where I can. First, you've asked: Why are there two rules of “Call” in the Ads categories? They are almost the same rules so why SAS did not combine they together as one rules. I guess it has something to do with the scoring order? I agree with Jared's initial thought here (thanks, Jared!). When SAS is parsing the data into separate tokens, each term-role-attribute combination is treated as a separate item. So, "call" as a Noun and Alpha would be treated separately than "call" as a verb and Alpha (or even "Call" as a Proper Noun or "Call" as a Custom Entity Type). My first guess is that if you go the preceding Text Filter node and sort the Terms table by the "Term" column and and scroll down, you'll see more than one row for "call". In fact, I'll go on to say that I'd expect one of these terms to have two child terms and not the other. The SAS Content Categorization code generated (shown first in your example) will list all the children of the parent term that the rule is based upon, so the fact that there are two forms of "call" in first instance (singular and plural) and only one form of "call" in the latter is one sign that these are considered separate tokens in Text Miner . If this is correct, then the separate rules that you see above are found from separate, distinct tokens in the Terms table. So that is why there are two rules. They are not combined simply because Text Miner considers them separate rules (though your idea of combining them is a valid potential enhancement). The order is implicit in having separate rules, but it's really about originating in separate rules rather than order. Second, you asked: Say, now I have to use these rules to score new dataset by writing in different language such as Java or Python. I guess the logic of these rules are like: If document.text contains (“facebook” or “fb”) and not contains (“book”or “booking”) then document.target= “Social” else if… Does the original orders of the rules matter in this case? Can SAS CC rules be used outsides of SAS? Yes, that is the idea of how the rules translate. A small caution is that depending on the 'contain' operator used in the translated rule, "book" may always be found to a substring of "facebook", so the translated rule would need to exclude substring matches. Sorry, Eric - I am sure this might look like a picky comment but I include it in the attempt for clarification for other readers. In your defense (even though your example was simply for illustrative points), some CONTAINS operators do exclude substrings, so this might be perfectly fine. And yes, you want to treat the rules as if the order does matter. In your particular example, I don't see any reason that order would matter, but there are situations where they might. And for the sake of caution, "might" is enough for me to suggest the general policy of 'treat the rules as if the order does matter'. The TM Help has this to say: The order of the rules in the table is important. The rule in the first row of the table is discovered by considering all the documents and is the first rule that is added into the rule set. The rule in the second row of the table is learned by analyzing all documents that were not covered by the first rule, and so on. When the rules are applied to new data for scoring, it is assumed that they will be applied in this same order. So your If, Else-if approach should work fine if the rule order is maintained. Sure, the discoveries by the TM Rule Builder node can be easily displayed in Content Categorization syntax, but this could be editted for use outside of SAS if desired. Hope this helps, Justin

JustinPlumley · ‎10-16-2013

Hi Eric, Sometimes translating numerical output into words is a challenge, so let me try my best to offer as clear an explanation I can at the moment. You have posted a good example - thank you. First, let me note some facts about the Text Rule Builder node that is helpful to keep in mind. (1) This is a predictive model that creates Boolean rules and assesses the predictive power of each rule as well as the overall classification rates. (2) Every rule has a posterior probability associated with each Target Level. So Rule #1 (e.g. "sports" and not "channel") has a [potentially different] posterior probability associated with Target=TV, Target=Food, Target-Sports, etc, from your example. These posterior probabilities are based on the training data which trained the predictive model. (3) The classification is based on the rule-assessments, which are evaluated with a binary response. That is, each rule resulted in a 'True' or 'False' outcome. Or pass/fail or 0/1, if you prefer. You example shows a column with: (a) your input variables. In this case, that is only one column: "Document". (b) n "Predicted: Target=target level" columns which provide a posterior probability that the record belongs to this target level. In this case, there are 8 target levels: TV, Food, ..., Game, Finance. The probabilities in these columns are a result of a naive Bayes algorithm which use the values from (2) above. (c) a "Why" column which indicates the rule number that determined the 'assigned' Target value. "32" stands for the 32nd rule, which can be found in the Text Rule Builder Results (without number label) or in the textrule_conj_rule table under the field conj_id. (d) an "Into: Target" column, which specifies the assigned Target Value for each record. (e) a column with the maximum value of the n "Predicted: Target=" values, or the maximum of the values in the columns from (b) above. Now I've finally set the stage for a respone to your questions. As your intuition tells you, most of the time, the target level with the highest probability in the "Predicted: Target=" column corresponds with the Target level that the Text Rule Builder node assigns. In other words, most of the time, the Text Rule Builder will assign the value that corresponds to the category with the "highest probability in which it belongs". Usually, this close to a 1:1 pattern - my experience with having a good sized training and 'score' sets are all 90+%. So why would the assigned Target level ever not correspond to the target value with the highest posterior probability? Answer: It is all about the rules! The predictive model will evaluate each of these Boolean rules and assign the category/target level for the triggering rule. For example: If the text contains "cell" and not "phone" and "reproduction" the assign the Target level = Biology. The probabilities are not probabilities of the category that the document would be assigned. Since we are talking about Boolean rules, those values would be a matrix of 0s and 1s. Instead, the probabilities are a prediction of where the document may belong based on the posterior probabilities of the rules. This is also still a good predictor of the assigned category, so it can be used as such, to answer your last question. In practice, I think it is fair to say that those records which are assigned to a category other than the one with the highest probability are good candidates for review. They may indicate reasons to augment the training data set and rerun the node, or it may represent a document that may cross two categories. For example, an article about game theory applied to ecosystems might explain your first example. Or something about cyber espionage may cross "Computer" and "Politics" such as in your second. Hopefully the clarification - that the probabilities represent a prediction of category based upon posterior probabilities where the assignment is based upon True/False rules - is helpful. Thanks, Justin

JustinPlumley · ‎09-26-2013

Hi, The dataset exported from the Text Topic Node includes the original (raw) data plus includes new variables associated with the topics. Simplifying slightly, the dataset exported from the Text Topic Node will contain additional binary indicators (whether the record belongs to the topic or not) as well as raw scores (like a projection onto that topic - mocked values in the example below): id text product _1_0_amber _1_0_memory _1_0_video amber memory video 1 Amber blink continuously x142 1 0 0 .9 .05 .08 2 Memory space problem x189 0 1 0 .1 2.3 .07 3 video blurr x902 0 0 1 .1 .1 1.1 As you can see, this includes the original (raw) dataset of 3 records, then has 3 additional binary columns (corresponding to the 3 topics) and 3 additional raw topic score columns (also corresponding to the 3 topics). At the moment, I am not sure why you would use both a START list (which says 'only use these terms') and a STOP list (which says 'do not use these terms') at the same time. It seems that I would simply remove words from the START list that also occurred in the STOP list, and then use this new START list. Would that help what you are trying to accomplish? As far as importing synonyms, that is available at the Text Filter node. Hope this helps!

JustinPlumley · ‎09-25-2013

Hi Heather, Yes, Enterprise Content Categorization now supports XPath functionality. So if you have, say, an xml document that looks like: ... <article> <body> Here is where we talk about Bill Nye. </body> <body> Here is another part of the body where we talk about Bill Cosby. </body> <title>Title</title> </article> Then there are multiple body elements. Let's say you want to match the name 'Bill' - but only in the second body element. Then we can use a rule of the form: (OR, _/article/body[2]:"_c{Bill}") which will not match Bill Nye but will match Bill Cosby. Hope this helps.

Online Status	Offline
Date Last Visited	‎09-01-2015 07:11 AM

Re: How to interpret the scoring data result of Text Rule Builder

Re: Text Categorization Rule

Re: How to interpret the scoring data result of Text Rule Builder

Re: Joining the text topic node output with the input data

Re: xpath in SAS Content Categorizer

Re: xpath in SAS Content Categorizer

Re: Text Categorization Rule

Re: How to interpret the scoring data result of Text Rule Builder

Re: How to interpret the scoring data result of Text Rule Builder

Re: How to interpret the scoring data result of Text Rule Builder

Re: Text Categorization Rule

Re: How to interpret the scoring data result of Text Rule Builder

Re: Joining the text topic node output with the input data

Re: xpath in SAS Content Categorizer