I have used textmine to find the frequencies of various words for each of a set of documents. Each document is a free text field that consists of one or more words- this free text field corresponds to one of the columns. There is also an Index field that indicates the document number along with various other columns.
I used the following code to output the term-document matrix:
I noticed that when looking at the term document matrix (the file mentioned in outchild), that if I search for the 'Index' column value for a document with or one (and sometimes two words) within the 'Document' column of the term-document frequency matrix, that often it cannot be found at all. It appears that textmine is not even processing this columns.
Is this a known feature of textmine (is it supposed to be doing this)?
Is there a simple option within textmine to stop it doing this (as opposed to editing the documents and adding a lot of place-filler words to increase the document length)?
I am wondering whether this might have to do with the fact that the algorithm cannot identify whether the words are verbs or nouns (which is something the algorithm does) when there are very few words and so it ignores the document altogether?
On the PARSE statement, please specify :
Still the same problem?
Are you sure the 1 or 2 words in the small documents (the ones you are missing in the output tables) are not in the stop list?
BR, Koen
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.