BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Erik_Zencos
Obsidian | Level 7

Hi, I noticed something strange in SAS CA that I hope someone can clarify for me. In SAS CA you can optionally drag words over to the "Dropped terms" list. In my project, I have still not used this option! However, the output dataset "all_terms_ds" which is automatically generated (contains all the terms in the text), contains a field called "keep" whit values "Y" or "N". Many of the terms have the value "N", and I suspect these terms will be dropped from the analysis. There are no settings in CA where I seem to be able to control this, and I can’t find any logic behind (except that the terms occur only in a few documents, but terms that occur a specific number of times can both have the value Y or N). Anybody know if these terms are actually dropped, and what the rules behind is, and if this rules can be changed?

1 ACCEPTED SOLUTION

Accepted Solutions
Erik_Zencos
Obsidian | Level 7

Update to my own question: I later found this in the CA user guide (chap. 1 page 2) that probably explains this behavior:

"By default, words that provide little or no value are excluded from analysis. Examples of these words include the articles a, an, and the and conjunctions such as and, or, and but. Other terms that are specific to your document collection but provide little or no value are also identified and excluded."

One should be aware of this, since we often use a training set, terms that can be important might end up being excluded because they are not well represented in the training data.

View solution in original post

1 REPLY 1
Erik_Zencos
Obsidian | Level 7

Update to my own question: I later found this in the CA user guide (chap. 1 page 2) that probably explains this behavior:

"By default, words that provide little or no value are excluded from analysis. Examples of these words include the articles a, an, and the and conjunctions such as and, or, and but. Other terms that are specific to your document collection but provide little or no value are also identified and excluded."

One should be aware of this, since we often use a training set, terms that can be important might end up being excluded because they are not well represented in the training data.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1231 views
  • 0 likes
  • 1 in conversation