Extracting Information from Text Documents in Business Contexts

3 Likes

In part 1 of this two-part series, I will concentrate on the Concept rule types. In the second part of the series, I will focus on the Fact rule types to address these questions:

In a series of documents, how can I find specific information, patterns or relationships between a series of words?
In a sentence, can I extract ALL the information that is between two important words, independently of the pattern within those two words?

These are some examples of those questions applied to a business context:

What are the most common complaints/comments customers have on a product?
What are the side effects patients report from a group of medical drugs?
What is the opinion that people have about a government official?
What are the specific times when auto accidents occurred?
If you are in the hospitality business, are there a set of features that your customers prefer, and which features are complained about?
In a quality control: what were the conditions that caused defective material?
What equipment is failing, and what are the types of failure?
For Veteran Claims: what is the time the claim was initiated and what is the nature of the claim itself (medical, loans, etc.)?

Extracting this type of information requires subject matter expertise and knowledge on how those patterns and relationships are present in the language. SAS Visual Text Analytics (VTA) is the tool to use to help develop rules that combine the power of Natural Language Processing (NLP), Machine Learning (ML) and human insight to build linguistic rules. LITI is the SAS proprietary programming language thru which these linguistic rules are expressed allowing for highly customized models.

In my previous articles, I have given several examples of LITI rules:

SAS’s LITI rules are powerful and flexible that one can develop rules for a great variety of business situations. In this article, I will concentrate on the Concept rule types which include CLASSIFIER, CONCEPT, C_CONCEPT and CONCEPT_RULE. In my next article, I will write about the Fact rule types which include PREDICATE_RULE and SEQUENCE.

Concepts and Facts in VTA

Concepts are useful for analyzing information in context, identifying recurring themes and extracting useful information from documents. Examples of concepts are: a book title, last name, city, gender, and so on. VTA provides at least seven* out-of-the-box predefined concepts such as Date (nlpDate), Money (nlpMoney), Organization (nlpOrganization), Percent (nlpPercent), Person (nlpPerson), Place (nlpPlace), Time (nlpTime) whose LITI rules are already defined in VTA to save development time. These predefined concepts can also be used to create Custom Concepts.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

*Note: There are nine predefined concepts in this screenshot because the selected language is English and there are a few additional supported predefined concepts for English.

Examples of predefined concepts would return matches such as:

nlpDate: March 12
nlpMoney: $5.20
nlpOrganization: SAS Institute, Inc
nlpPercent: 5%
nlpPerson: Martin Luther King Jr.
nlpPlace: Seattle
nlpTime: 8:20 AM

The screenshot below shows some of the matches for the nlpNounGroup predefined concept.

two (1).png

Facts are related pieces of information in text that are located and matched together and there are specific LITI rules to identify them. I will cover Facts in more detail in the second post.

Custom concepts, on the other hand, are created by writing LITI rules to recognize items in context so that we can match and extract only the pieces of the document that match the rule. For example, you can specify that the concept kitchen is identified when the terms refrigerator, sink, and countertop are encountered in the document.

LITI syntax includes Boolean and distance operators and can reference part-of-speech tags. It is easier to understand LITI rules by providing examples; for most examples in this article I will use a real-world data set with customer reviews of a hospitality firm.

For the rule types CONCEPT, C_CONCEPT and CONCEPT_RULE one can use morphological expansion operators to return inflected forms of a word

also Part-of-speech (POS) tags which enable you to locate matches by the part of speech that the searched item belongs to, rather than locating a specific term

Add a New Concept

This screenshot shows the steps to create a new concept. This example shows the concept LITI syntax for the HotelAmenities custom concept .

Alternatively you could use the New concept button:

five (1).png

Rule Type CLASSIFIER

This rule type is the initial building block to defining custom concepts. It identifies single terms or strings that you want matched in context.

Example #1

Define the concept myKitchen as

Example #2

Define the concept RoomAmenities as

seven (1).png

Rule Type CONCEPT

This rule type identifies related information by referencing other concepts.

Example # 3

Using the custom concept RoomAmenities defined in Example #2, I define the custom concept NiceRoom as

These rules read:

Return matches for the string “hotels with” followed by a match to the HotelAmenities
Return matches for the terms “room with” followed by a match to the RoomAmenities concept

Example #4

Assume you previously defined the custom concepts HotelAmenities, RoomAmenities, ConvenientLocation. By referring to those previously defined custom concepts, I will define the conceptBestOptions as

The custom concept ConvenientLocation is defined in Example #9 later.

Example #5

The hospitality firm dataset had documents where the owner of the facility is referred to as host. A simple rule type CONCEPT can be used to define the custom concept myHost as

Notice the ::PN syntax. This will help match proper nouns. Matches like the last one shown above indicate we need to refine further the concept myHost since we are getting matches for URL, Nespresso and we want to narrow this down to match for people’s names.

Example #6

As I mentioned above, we can further refine the previous Example #5 and find better matches for good hosts, by defining the concept wonderfulHost as

Notice the addition of :PN for proper noun, :V for verb , and :A for adjective.

Rule Type C_CONCEPT

This rule type is used to extract information that occurs in a specific context or pattern. To extract we use “_c”. As with the rule type CONCEPT one can use morphological expansion operators and POS tags.

Example #7

This example doesn’t use the hospitality dataset but illustrates this rule type very well.

Assume these are concepts already defined:

concept governmentOfficial that contains words such as president, governor, majority leader, senator, and senators, and
concept stateUS that includes names of USA states

By defining the concept officialAndState as

C_CONCEPT: governmentOfficial _c{_cap _cap} :Prep stateUS

A match for this concept will be this phrase:

Governor Roy Cooper of North Carolina has proclaimed May 7-11, 2018 as National Teacher Appreciation Week

In this rule with use _cap which represents any capitalized word, and to extract two capitalized words we used _c{_cap _cap}. The string extracted is “Roy Cooper”. Also, notice that we use the POS tag :Prep

However, the following text would not produce a match because the word and is not a preposition: Senators Phineas Craymoor and Garrett Garcia from North Carolina pushed the bill through.

Rule Type CONCEPT_RULE

This rule type is more powerful than the previous ones, and therefore uses more computational resources. It uses Boolean and proximity operators to determine matches.

The most frequently used Boolean and Proximity Operators are: AND, OR, NOT, SENT (sentence), DIST (distance). Here are useful examples on how to use them.

Example #8

NOT is used to find matches if the argument does not occur in the whole document. NOT must be used with the AND operator. Let’s define the concept trainStationInWalkingDistance as

Example #9

SENT_n matches if all arguments occur IN A SENTENCE within n (or fewer) tokens of each other, regardless of their order.

DIST_N matches if all arguments occur within n (or fewer) tokens of each other, regardless of their order.

In Example #4 above, I used the concept convenientLocation which is defined as

This last example shows the importance to catch misspellings … and how challenging and interesting is to do Text Analytics!

I mentioned in my previous post “Analysis of Movie Reviews using Visual Text Analytics” as in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision desired.

My next article will be on rule types for Facts, be sure to check it out!

Additional Resources

Check out this new book: SAS Text Analytics for Business Applications.
Check out these three articles on NLP, Artificial Intelligence and using Text Analytics to analyze descriptions of side effects or adverse events that patients have reported following a vaccination:
- Natural Language Processing
- Artificial intelligence, machine learning, deep learning and beyond machine learning, deep learning ...