Text mining and content categorization

Seven tricky sentences for NLP and text mining algorithms

Posts: 36

Seven tricky sentences for NLP and text mining algorithms

Seven tricky sentences for NLP and text mining algorithms - AnalyticBridge

Posted by Mirko Krivanek on Text Mining - AnalyticBridge

I thought that these were very interesting.

  1. "A land of milk and honey" becomes "A land of Milken Honey" (algorithm trained on Wall Street Journal from the 1980's where Michael Milken was mentioned much more than milk)
  2. "She threw up her dinner" vs. "She threw up her hands"
  3. "I ate a tomato with salt" vs. "I ate a tomato with my mother" or "I ate a tomato with a fork"
  4. Words ending with -ing, e.g. "They were entertaining people"
  5. "He washed and dried the dishes", vs. "He drank and smoked cigars" (in the latter case he did not drunk cigars)
  6. "The lamb was ready to eat" vs. "Was the lamb hungry and wanting some grass?"
  7. Words with multiple meaning (e.g. a bay can be a color, type of window or body of water)

I would add to the above 7, words that are often used interchangeably, but are intended to mean two different things. For example in the Development Experience Clearinghouse, evaluations are intended to be used to describe documents that analyze either the performance of a project or the impact a project has made on a sector or geographic location. Assessments are supposed to be documents that analyze the conditions of a particular sector or geographical location before a project or program takes place. And yet, the terms are often used indiscriminately within the documents themselves. A human can look at the document and discern if it is an assessment or an evaluation, but it's very difficult to write rules for the SAS Content Categorization Studio to parse the differences.

What linguistic challenges do others have when writing profile rules or texting mining algorithms?

Posts: 71

Re: Seven tricky sentences for NLP and text mining algorithms

That's very interesting.  Cases like these are why a good training corpus is necessary.

It's funny how #3 and #6 seem fixable with a slight word change.  "I ate tomato with salt" and "The lamb was cooked and ready to eat".  The others are not as easily fixed.

If the classification rules are difficult to make, perhaps the corpus can be modified.  While this is usually never the case, some projects can entertain this as an option.

Ask a Question
Discussion stats
  • 1 reply
  • 2 in conversation