BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Erik_Zencos
Obsidian | Level 7

Hi, I noticed something strange in SAS CA that I hope someone can clarify for me. In SAS CA you can optionally drag words over to the "Dropped terms" list. In my project, I have still not used this option! However, the output dataset "all_terms_ds" which is automatically generated (contains all the terms in the text), contains a field called "keep" whit values "Y" or "N". Many of the terms have the value "N", and I suspect these terms will be dropped from the analysis. There are no settings in CA where I seem to be able to control this, and I can’t find any logic behind (except that the terms occur only in a few documents, but terms that occur a specific number of times can both have the value Y or N). Anybody know if these terms are actually dropped, and what the rules behind is, and if this rules can be changed?

1 ACCEPTED SOLUTION

Accepted Solutions
Erik_Zencos
Obsidian | Level 7

Update to my own question: I later found this in the CA user guide (chap. 1 page 2) that probably explains this behavior:

"By default, words that provide little or no value are excluded from analysis. Examples of these words include the articles a, an, and the and conjunctions such as and, or, and but. Other terms that are specific to your document collection but provide little or no value are also identified and excluded."

One should be aware of this, since we often use a training set, terms that can be important might end up being excluded because they are not well represented in the training data.

View solution in original post

1 REPLY 1
Erik_Zencos
Obsidian | Level 7

Update to my own question: I later found this in the CA user guide (chap. 1 page 2) that probably explains this behavior:

"By default, words that provide little or no value are excluded from analysis. Examples of these words include the articles a, an, and the and conjunctions such as and, or, and but. Other terms that are specific to your document collection but provide little or no value are also identified and excluded."

One should be aware of this, since we often use a training set, terms that can be important might end up being excluded because they are not well represented in the training data.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1311 views
  • 0 likes
  • 1 in conversation