Text mining and content categorization

SAS Contextual Analysis - terms to keep

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

SAS Contextual Analysis - terms to keep

Hi, I noticed something strange in SAS CA that I hope someone can clarify for me. In SAS CA you can optionally drag words over to the "Dropped terms" list. In my project, I have still not used this option! However, the output dataset "all_terms_ds" which is automatically generated (contains all the terms in the text), contains a field called "keep" whit values "Y" or "N". Many of the terms have the value "N", and I suspect these terms will be dropped from the analysis. There are no settings in CA where I seem to be able to control this, and I can’t find any logic behind (except that the terms occur only in a few documents, but terms that occur a specific number of times can both have the value Y or N). Anybody know if these terms are actually dropped, and what the rules behind is, and if this rules can be changed?


Accepted Solutions
Solution
‎01-06-2016 10:31 AM
New Contributor
Posts: 2

Re: SAS Contextual Analysis - terms to keep

Update to my own question: I later found this in the CA user guide (chap. 1 page 2) that probably explains this behavior:

"By default, words that provide little or no value are excluded from analysis. Examples of these words include the articles a, an, and the and conjunctions such as and, or, and but. Other terms that are specific to your document collection but provide little or no value are also identified and excluded."

One should be aware of this, since we often use a training set, terms that can be important might end up being excluded because they are not well represented in the training data.

View solution in original post


All Replies
Solution
‎01-06-2016 10:31 AM
New Contributor
Posts: 2

Re: SAS Contextual Analysis - terms to keep

Update to my own question: I later found this in the CA user guide (chap. 1 page 2) that probably explains this behavior:

"By default, words that provide little or no value are excluded from analysis. Examples of these words include the articles a, an, and the and conjunctions such as and, or, and but. Other terms that are specific to your document collection but provide little or no value are also identified and excluded."

One should be aware of this, since we often use a training set, terms that can be important might end up being excluded because they are not well represented in the training data.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 441 views
  • 0 likes
  • 1 in conversation