BookmarkSubscribeRSS Feed
JuliaM
Calcite | Level 5

Seven tricky sentences for NLP and text mining algorithms - AnalyticBridge


Posted by Mirko Krivanek on Text Mining - AnalyticBridge

I thought that these were very interesting.

  1. "A land of milk and honey" becomes "A land of Milken Honey" (algorithm trained on Wall Street Journal from the 1980's where Michael Milken was mentioned much more than milk)
  2. "She threw up her dinner" vs. "She threw up her hands"
  3. "I ate a tomato with salt" vs. "I ate a tomato with my mother" or "I ate a tomato with a fork"
  4. Words ending with -ing, e.g. "They were entertaining people"
  5. "He washed and dried the dishes", vs. "He drank and smoked cigars" (in the latter case he did not drunk cigars)
  6. "The lamb was ready to eat" vs. "Was the lamb hungry and wanting some grass?"
  7. Words with multiple meaning (e.g. a bay can be a color, type of window or body of water)

I would add to the above 7, words that are often used interchangeably, but are intended to mean two different things. For example in the Development Experience Clearinghouse, evaluations are intended to be used to describe documents that analyze either the performance of a project or the impact a project has made on a sector or geographic location. Assessments are supposed to be documents that analyze the conditions of a particular sector or geographical location before a project or program takes place. And yet, the terms are often used indiscriminately within the documents themselves. A human can look at the document and discern if it is an assessment or an evaluation, but it's very difficult to write rules for the SAS Content Categorization Studio to parse the differences.

What linguistic challenges do others have when writing profile rules or texting mining algorithms?

1 REPLY 1
jaredp
Quartz | Level 8

That's very interesting.  Cases like these are why a good training corpus is necessary.

It's funny how #3 and #6 seem fixable with a slight word change.  "I ate tomato with salt" and "The lamb was cooked and ready to eat".  The others are not as easily fixed.

If the classification rules are difficult to make, perhaps the corpus can be modified.  While this is usually never the case, some projects can entertain this as an option.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 2167 views
  • 0 likes
  • 2 in conversation