term-document frequency matrix from textmine is ignoring small documen...

William29 · Posted 09-04-2024 12:18 AM

I have used textmine to find the frequencies of various words for each of a set of documents. Each document is a free text field that consists of one or more words- this free text field corresponds to one of the columns. There is also an Index field that indicates the document number along with various other columns.

I used the following code to output the term-document matrix:

I noticed that when looking at the term document matrix (the file mentioned in outchild), that if I search for the 'Index' column value for a document with or one (and sometimes two words) within the 'Document' column of the term-document frequency matrix, that often it cannot be found at all. It appears that textmine is not even processing this columns.

Is this a known feature of textmine (is it supposed to be doing this)?

Is there a simple option within textmine to stop it doing this (as opposed to editing the documents and adding a lot of place-filler words to increase the document length)?

I am wondering whether this might have to do with the fact that the algorithm cannot identify whether the words are verbs or nouns (which is something the algorithm does) when there are very few words and so it ignores the document altogether?

sbxkoenk · Posted 09-09-2024 09:43 AM

On the PARSE statement, please specify :

NONOUNGROUPS | NONG : Suppresses noun group extraction in parsing
NOSTEMMING : Suppresses stemming in parsing
NOTAGGING : Suppresses part-of-speech tagging in parsing
SHOWDROPPEDTERMS : Includes dropped terms in the OUTTERMS= data table

Still the same problem?
Are you sure the 1 or 2 words in the small documents (the ones you are missing in the output tables) are not in the stop list?

BR, Koen

term-document frequency matrix from textmine is ignoring small documents with one or two words

Re: term-document frequency matrix from textmine is ignoring small documents with one or two words

Catch up on SAS Innovate 2026