Text mining and content categorization

Term Filters - What Exactly Does the Minimum Number of Documents Do?

Frequent Contributor
Posts: 115

Term Filters - What Exactly Does the Minimum Number of Documents Do?

I think the default for the Minimum Number of Documents comes in at 4 when I create a Text Filter node. This is clearly too low when I am dealing with millions of documents. I am in the process of experimenting with what this number should ideally be.

Now I am sort of hooked on the Text Rule Builder Node as well. The main information I hope to retrieve from the Text Miner are words and/or phrases that I should dichotomize and put in as predictors for an eventual Decision Tree. I fully understand that the Text Miner might be most useful in terms of creating Factors or Clusters, but I am hoping to use it as a stepping stone for my ultimate Decision Tree.

As part of my output from the Text Rule Builder I get:

  • The rule
  • Target values - which is very helpful in my work
  • Precision & recall
  • F1 score - Do not know what this is
  • Ratio of True Positives to Total.

I do not understand how the Total denominator in the last column of my output can be less than my minimum number of documents. I do see a direct association between the number of rules that come out and the minimum number of documents - but I am looking for a little precision and understanding. Anything to help me and the community here is highly valued and appreciated.

Ask a Question
Discussion stats
  • 0 replies
  • 1 in conversation