Text mining and content categorization

Text Categorization Rule

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Text Categorization Rule

Hi,

I am doing the document categorization in the SAS text miner 12.1. After the training process done in the text rule builder, I got the following part of rules in the Content Categorization Code window and the Rules Obtained Results windows,

F_L2 =Social ::
(OR
, (AND, (OR, "facebook" , "fb" ), (NOT, ("book" , "booking") ))
, (AND, (OR, "friend" , "friends" ), (OR, "" , "noticed" , "notice" ))
F_L2 =Ads ::
(OR
, (AND, (OR, "cheapest" , "cheap" , "cheaper" ))
, (AND, (OR, "call" , "calls" ))
, (AND, (OR, "price" , "pricey" , "prices" ), (OR, "higher" , "high" , "high" ))
, "ripoff"
, (AND, (OR, "well price" , "best price" ))
"Call"

-------------------------------------

Target ValueTrue Positive/TotalRemaining Positive/TotalRule
Social5/1160/3,244facebook & ~post
Social2/455/3,233Twitter
Social4/1253/3,229friend & notice
Ads11/12226/3,217cheap
Ads151/229215/3,205call
Ads7/964/2,976price & high
Ads3/457/2,967ripoff
Ads3/554/2,963price
Ads6/951/2,958call


The SAS CC rules on the upper side match with the rules in the bottom side because they are the same rules. My questions are as follow,

er

  • Why are there two rules of “Call” in the Ads categories? They are almost the same rules so why SAS did not combine they together as one rules. I guess it has something to do with the scoring order?
  • Say, now I have to use these rules to score new dataset by writing in different language such as Java or Python. I guess the logic of these rules are like:

If document.text contains (“facebook” or “fb”) and not contains (“book”or “booking”)

               then document.target= “Social”

else if…

Does the original orders of the rules matter in this case? Can SAS CC rules be used outsides of SAS?

Thanks,

Eric


Accepted Solutions
Solution
‎10-31-2013 04:25 PM
SAS Employee
Posts: 5

Re: Text Categorization Rule

Hi Eric,

I'll try to respond to each of your questions separately and will reference your example above where I can.

First, you've asked:

  • Why are there two rules of “Call” in the Ads categories? They are almost the same rules so why SAS did not combine they together as one rules. I guess it has something to do with the scoring order?

I agree with Jared's initial thought here (thanks, Jared!). When SAS is parsing the data into separate tokens, each term-role-attribute combination is treated as a separate item.  So, "call" as a Noun and Alpha would be treated separately than "call" as a verb and Alpha (or even "Call" as a Proper Noun or "Call" as a Custom Entity Type).  My first guess is that if you go the preceding Text Filter node and sort the Terms table by the "Term" column and and scroll down, you'll see more than one row for "call".  In fact, I'll go on to say that I'd expect one of these terms to have two child terms and not the other. The SAS Content Categorization code generated (shown first in your example) will list all the children of the parent term that the rule is based upon, so the fact that there are two forms of "call" in first instance (singular and plural) and only one form of "call" in the latter is one sign that these are considered separate tokens in Text Miner .  If this is correct, then the separate rules that you see above are found from separate, distinct tokens in the Terms table.

So that is why there are two rules.  They are not combined simply because Text Miner considers them separate rules (though your idea of combining them is a valid potential enhancement).  The order is implicit in having separate rules, but it's really about originating in separate rules rather than order.

Second, you asked:

  • Say, now I have to use these rules to score new dataset by writing in different language such as Java or Python. I guess the logic of these rules are like:

          If document.text contains (“facebook” or “fb”) and not contains (“book”or “booking”)

               then document.target= “Social”

          else if…

     Does the original orders of the rules matter in this case? Can SAS CC rules be used outsides of SAS?

Yes, that is the idea of how the rules translate.

A small caution is that depending on the 'contain' operator used in the translated rule, "book" may always be found to a substring of "facebook", so the translated rule would need to exclude substring matches.  Sorry, Eric - I am sure this might look like a picky comment but I include it in the attempt for clarification for other readers.  In your defense (even though your example was simply for illustrative points), some CONTAINS operators do exclude substrings, so this might be perfectly fine.

And yes, you want to treat the rules as if the order does matter.  In your particular example, I don't see any reason that order would matter, but there are situations where they might.  And for the sake of caution, "might" is enough for me to suggest the general policy of 'treat the rules as if the order does matter'.  The TM Help has this to say:

       
          The order of the rules in the table is important. The rule in the first row of the table is discovered by considering all the documents and is the first rule that is added into the rule           set. The rule in the second row of the table is learned by analyzing all documents that were not covered by the first rule, and so on. When the rules are applied to new data for           scoring, it is assumed that they will be applied in this same order.

So your If, Else-if approach should work fine if the rule order is maintained.  Sure, the discoveries by the TM Rule Builder node can be easily displayed in Content Categorization syntax, but this could be editted for use outside of SAS if desired.

Hope this helps,

Justin  
       
     

       

 

View solution in original post


All Replies
Contributor
Posts: 71

Re: Text Categorization Rule

I'm barely familiar with what you are doing above.  I haven't got into the rule building stuff yet.  But I did notice, in regards to your first question, that perhaps "call" is different from "Call" in the Rule column?  Another guess I have is that one is a verb and the other is not?

I can't answer your second question, but I am interested in the answer.  My gut feeling is that order shouldn't matter.  If it did, then I'd think that the rules are not fine tuned enough?  But perhaps the order of rules is a rule in and of itself.... 

Good questions...thanks for asking them.

Occasional Contributor
Posts: 7

Re: Text Categorization Rule

Hi Jared,

Thanks for you attention on this topic. The origenal two "call" rule code can be found in the upper part as

  • , (AND, (OR, "call" , "calls" ))
  • "Call"

From what I see the only difference is that the first "call" rule has one more OR statement of the child word "calls". This situation happened quite often in the rest of the rule code I got.Since scoring dataset will not be parsed(or getting part of speech tag) during the whole scoring process, the guess that the two "call" rules represent different part of speech (verb, noun) should be less likely.

Unfortuntely, I can hardly find more detail explanation about these rules in SAS help documentation.

Solution
‎10-31-2013 04:25 PM
SAS Employee
Posts: 5

Re: Text Categorization Rule

Hi Eric,

I'll try to respond to each of your questions separately and will reference your example above where I can.

First, you've asked:

  • Why are there two rules of “Call” in the Ads categories? They are almost the same rules so why SAS did not combine they together as one rules. I guess it has something to do with the scoring order?

I agree with Jared's initial thought here (thanks, Jared!). When SAS is parsing the data into separate tokens, each term-role-attribute combination is treated as a separate item.  So, "call" as a Noun and Alpha would be treated separately than "call" as a verb and Alpha (or even "Call" as a Proper Noun or "Call" as a Custom Entity Type).  My first guess is that if you go the preceding Text Filter node and sort the Terms table by the "Term" column and and scroll down, you'll see more than one row for "call".  In fact, I'll go on to say that I'd expect one of these terms to have two child terms and not the other. The SAS Content Categorization code generated (shown first in your example) will list all the children of the parent term that the rule is based upon, so the fact that there are two forms of "call" in first instance (singular and plural) and only one form of "call" in the latter is one sign that these are considered separate tokens in Text Miner .  If this is correct, then the separate rules that you see above are found from separate, distinct tokens in the Terms table.

So that is why there are two rules.  They are not combined simply because Text Miner considers them separate rules (though your idea of combining them is a valid potential enhancement).  The order is implicit in having separate rules, but it's really about originating in separate rules rather than order.

Second, you asked:

  • Say, now I have to use these rules to score new dataset by writing in different language such as Java or Python. I guess the logic of these rules are like:

          If document.text contains (“facebook” or “fb”) and not contains (“book”or “booking”)

               then document.target= “Social”

          else if…

     Does the original orders of the rules matter in this case? Can SAS CC rules be used outsides of SAS?

Yes, that is the idea of how the rules translate.

A small caution is that depending on the 'contain' operator used in the translated rule, "book" may always be found to a substring of "facebook", so the translated rule would need to exclude substring matches.  Sorry, Eric - I am sure this might look like a picky comment but I include it in the attempt for clarification for other readers.  In your defense (even though your example was simply for illustrative points), some CONTAINS operators do exclude substrings, so this might be perfectly fine.

And yes, you want to treat the rules as if the order does matter.  In your particular example, I don't see any reason that order would matter, but there are situations where they might.  And for the sake of caution, "might" is enough for me to suggest the general policy of 'treat the rules as if the order does matter'.  The TM Help has this to say:

       
          The order of the rules in the table is important. The rule in the first row of the table is discovered by considering all the documents and is the first rule that is added into the rule           set. The rule in the second row of the table is learned by analyzing all documents that were not covered by the first rule, and so on. When the rules are applied to new data for           scoring, it is assumed that they will be applied in this same order.

So your If, Else-if approach should work fine if the rule order is maintained.  Sure, the discoveries by the TM Rule Builder node can be easily displayed in Content Categorization syntax, but this could be editted for use outside of SAS if desired.

Hope this helps,

Justin  
       
     

       

 

Occasional Contributor
Posts: 7

Re: Text Categorization Rule

Hi Justin,

Thanks again. Your answers are always so helpful.

Now I understand that the orders of these rules does matter and I can see how it works in the True Positive/Total and Remaining Positive/Total columns. And you are right that this is not a well-thought sample code so it only shows my idea of translating the rules into other programming language. Thanks for your clarification on the code.

And for the two similar rules, I looked into the filter viewer in the Text Filter node and the two “call” words are tagged with different part of speech. This is kind of a pain for me because that means when I translate these rules into other language, I also have to include the part of speech/term role tagging things. Besides that, scoring documents will also have to be parsed and tagged with part of speech/term role (noun, verb, etc. ) which could be quite a challenge and extra work. I know Python’s NLTK package can do this job but if the people in the engineering side said that they have to deploy the rule in other language such as java then I really don't know what to do then.

But then I realize that these “same-word different-role” rules only occurred within the same single category and never happen between different categories. So like you said, it should be okay to combine these rules into one rule and do without the term role parsing part. In the end, they will just be assigned to the same category anyway.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 1167 views
  • 7 likes
  • 3 in conversation