Extracting Information from Text Documents in Business Contexts - Part Two

1 Like

This is the second part of a two-part series. In Part 1, I focused on the Concept rule types. In this article, I will concentrate on the Fact rule types.

Facts are related pieces of information in the text that are located and matched together.

Concept and Fact rule types are used to address these questions:

In a series of documents, how can I find specific information, patterns or relationships between a series of words?
In a sentence, can I extract ALL the information that is between two important words, independently of the pattern within those two words?

Extracting this type of information requires subject matter expertise and knowledge on how those patterns and relationships are present in the language, and SAS Visual Text Analytics (VTA) is the tool to use. It combines the power of Natural Language Processing (NLP), Machine Learning (ML) and human insight to build linguistic rules.

In Part 1, I wrote about the rule types CLASSIFIER, CONCEPT, C_CONCEPT and CONCEPT_RULE. They are very useful in exploratory analysis. In this article, I write about the Fact rule types which include PREDICATE_RULE and SEQUENCE. The order I addressed these rules corresponds to their complexity. Remember, always use the simplest rule possible to get the job done.

I will continue using the hospitality data and Custom Concepts introduced in Part 1.

Facts can be identified within a Custom Concept. For example, suppose you want to identify fun activities for families with children. First, we build one custom concept that identifies families and a separate custom concept that identifies fun activities. Then we build a Fact named Fact_family_fun using those concepts to find reviews where those two concepts are related.

Another example is seen in Christina Hsiao’s article, Automatically extracting key information from textual data, which illustrates the process on how to find the relationships between patients’ reactions and where on the body they occurred. In that article, the concept BodyParts (muscle, arm, head, etc.) and the concept LocalizedSymptoms (pain, redness, fever, etc.) are used to build the Fact named PatientSymptoms.

Further examples can be found in the SAS published book, SAS Text Analytics for Business Applications (by Jade, Belamaric Wilsey and Wallis). Here the combined authors describe how to conduct an Information Extraction process efficiently. It provides a wealth of information on LITI rules: best practices, tips, how to avoid common mistakes, etc. From its section What does this book Cover “… because real-world examples are essential for increased relevance to users; the book presents best practices from seasoned practitioners through realistic use cases and real data as much as possible. “

Rule Type SEQUENCE

SEQUENCE is used when the order of the items in the Fact is important, and you want to extract several terms in highly structured text, where you know that the pattern specified in the rule exists. A sequence rule can detect a structure so that each term in the Fact matches in the order that you specify with no intervening items.

Example #1

To illustrate this LITI rule, I will define the concept Sequence_funActivitiesTypes. This concept uses two helper concepts funActivities and funActivitiesType:

Step 1: Define the helper custom concepts

Define the concept funActivities as

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Define the concept funActivitiesType as

Step 2: Now I will use the Sequence rule which combines the two custom concepts I just defined above. This Fact named Sequence_funActivitiesType is defined as

To best illustrate how the sequence rule works, I will use the Test Sample Text tab. This tab is used to test the rule that is defined in the Edit Concept window. Only the text that matches the defined rule will be highlighted.

In the screenshot below, the rule SEQUENCE:(type,activity):_type{funActivitiesType} _activity{funActivities} is defined. Notice that a match is found for the first paragraph only, it is highlighted in blue below. I’ve used yellow to highlight similar texts but notice that they don’t match the SEQUENCE being defined.

To extract “Museum of Arts” or “Museum of Fine Arts”, we must write a different sequence rule from the previous one, Sequence_funActivitiesType. The concept Sequence_FineArts shows how to find those matches

A possible question at this point could be: Should the rules used to define Sequence_FineArts be combined with the concept Sequence_funActivitiesType? Or, more generally, should I build custom concepts with several rule definitions? In this example, yes, you might want to combine these SEQUENCE concepts; however, this might not apply for other projects.

A project which includes concept definitions with many different rule types will be hard to maintain. In general, it is easier to work on developing the LITI rules using the Test Sample Text tab to verify a rule is working as I expect, then I can determine how I need to combine rules after I have everything working.

Rule Type PREDICATE_RULE

PREDICATE_RULE is used to extract several pieces of information using Boolean, distance, and morphological operators and part of speech tags. If you need a brief review of these items check Part 1. The PREDICATE_RULE type should be used where the CONCEPT_RULE cannot achieve the same results. Order in which the elements to be extracted appear in the sentence is not restrictive for this rule type. This rule extracts the text between the two elements.

Example #2

We can use predefined concepts in a PREDICATE_RULE; as in this example, where I am using nlpPlace. In the first paragraph, we can see that in this rule type, the order of the terms extracted is not necessarily the order in which they are written in the rule. Also, notice that all terms are extracted between concept1 and concept2, which is the custom concept funActivities defined earlier in this article.

Example #3

We can extract whole sentences, for example, if we wanted to know the room Amenities or hotel Amenities that are mentioned with wonderful hosts. In Part 1, I defined the concepts roomAmenities and hotelAmenities. We can now use these custom concepts to define the concept Fact_host_amenities as

Action Items

For further learning about these topics, a helpful reference is the book SAS Text Analytics for Business Applications.

Thanks

To my SAS Asia Pacific colleagues for attending the VTA workshop and generating the questions that steered this two-part series.

SAS Communities Library