03-03-2013 07:27 PM
Posted by Mirko Krivanek on Text Mining - AnalyticBridge
I thought that these were very interesting.
I would add to the above 7, words that are often used interchangeably, but are intended to mean two different things. For example in the Development Experience Clearinghouse, evaluations are intended to be used to describe documents that analyze either the performance of a project or the impact a project has made on a sector or geographic location. Assessments are supposed to be documents that analyze the conditions of a particular sector or geographical location before a project or program takes place. And yet, the terms are often used indiscriminately within the documents themselves. A human can look at the document and discern if it is an assessment or an evaluation, but it's very difficult to write rules for the SAS Content Categorization Studio to parse the differences.
What linguistic challenges do others have when writing profile rules or texting mining algorithms?
03-04-2013 11:00 AM
That's very interesting. Cases like these are why a good training corpus is necessary.
It's funny how #3 and #6 seem fixable with a slight word change. "I ate tomato with salt" and "The lamb was cooked and ready to eat". The others are not as easily fixed.
If the classification rules are difficult to make, perhaps the corpus can be modified. While this is usually never the case, some projects can entertain this as an option.