BookmarkSubscribeRSS Feed
William29
Obsidian | Level 7

I have used textmine to find the frequencies of various words for each of a set of documents. Each document is a free text field that consists of one or more words- this free text field corresponds to one of the columns. There is also an Index field that indicates the document number along with various other columns.

 

I used the following code to output the term-document matrix:

 

William29_0-1725423083486.png

I noticed that when looking at the term document matrix (the file mentioned in outchild), that if I search for the 'Index' column value for a document with or one (and sometimes two words) within the 'Document' column of the term-document frequency matrix, that often it cannot be found at all. It appears that textmine is not even processing this columns.

 

Is this a known feature of textmine (is it supposed to be doing this)?

 

Is there a simple option within textmine to stop it doing this (as opposed to editing the documents and adding a lot of place-filler words to increase the document length)?

 

I am wondering whether this might have to do with the fact that the algorithm cannot identify whether the words are verbs or nouns (which is something the algorithm does) when there are very few words and so it ignores the document altogether?

1 REPLY 1
sbxkoenk
SAS Super FREQ

On the PARSE statement, please specify :

  • NONOUNGROUPS | NONG  :  Suppresses noun group extraction in parsing
  • NOSTEMMING  :  Suppresses stemming in parsing
  • NOTAGGING  :  Suppresses part-of-speech tagging in parsing
  • SHOWDROPPEDTERMS  :  Includes dropped terms in the OUTTERMS= data table

Still the same problem? 
Are you sure the 1 or 2 words in the small documents (the ones you are missing in the output tables) are not in the stop list?


BR, Koen

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 731 views
  • 0 likes
  • 2 in conversation