In part 1 of this two-part series, I will concentrate on the Concept rule types. In the second part of the series, I will focus on the Fact rule types to address these questions:
These are some examples of those questions applied to a business context:
Extracting this type of information requires subject matter expertise and knowledge on how those patterns and relationships are present in the language. SAS Visual Text Analytics (VTA) is the tool to use to help develop rules that combine the power of Natural Language Processing (NLP), Machine Learning (ML) and human insight to build linguistic rules. LITI is the SAS proprietary programming language thru which these linguistic rules are expressed allowing for highly customized models.
In my previous articles, I have given several examples of LITI rules:
SAS’s LITI rules are powerful and flexible that one can develop rules for a great variety of business situations. In this article, I will concentrate on the Concept rule types which include CLASSIFIER, CONCEPT, C_CONCEPT and CONCEPT_RULE. In my next article, I will write about the Fact rule types which include PREDICATE_RULE and SEQUENCE.
Concepts are useful for analyzing information in context, identifying recurring themes and extracting useful information from documents. Examples of concepts are: a book title, last name, city, gender, and so on. VTA provides at least seven* out-of-the-box predefined concepts such as Date (nlpDate), Money (nlpMoney), Organization (nlpOrganization), Percent (nlpPercent), Person (nlpPerson), Place (nlpPlace), Time (nlpTime) whose LITI rules are already defined in VTA to save development time. These predefined concepts can also be used to create Custom Concepts.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
*Note: There are nine predefined concepts in this screenshot because the selected language is English and there are a few additional supported predefined concepts for English.
Examples of predefined concepts would return matches such as:
The screenshot below shows some of the matches for the nlpNounGroup predefined concept.
Facts are related pieces of information in text that are located and matched together and there are specific LITI rules to identify them. I will cover Facts in more detail in the second post.
Custom concepts, on the other hand, are created by writing LITI rules to recognize items in context so that we can match and extract only the pieces of the document that match the rule. For example, you can specify that the concept kitchen is identified when the terms refrigerator, sink, and countertop are encountered in the document.
LITI syntax includes Boolean and distance operators and can reference part-of-speech tags. It is easier to understand LITI rules by providing examples; for most examples in this article I will use a real-world data set with customer reviews of a hospitality firm.
For the rule types CONCEPT, C_CONCEPT and CONCEPT_RULE one can use morphological expansion operators to return inflected forms of a word
also Part-of-speech (POS) tags which enable you to locate matches by the part of speech that the searched item belongs to, rather than locating a specific term
This screenshot shows the steps to create a new concept. This example shows the concept LITI syntax for the HotelAmenities custom concept .
Alternatively you could use the New concept button:
This rule type is the initial building block to defining custom concepts. It identifies single terms or strings that you want matched in context.
Define the concept myKitchen as
Define the concept RoomAmenities as
This rule type identifies related information by referencing other concepts.
Using the custom concept RoomAmenities defined in Example #2, I define the custom concept NiceRoom as
These rules read:
Assume you previously defined the custom concepts HotelAmenities, RoomAmenities, ConvenientLocation. By referring to those previously defined custom concepts, I will define the conceptBestOptions as
The custom concept ConvenientLocation is defined in Example #9 later.
The hospitality firm dataset had documents where the owner of the facility is referred to as host. A simple rule type CONCEPT can be used to define the custom concept myHost as
Notice the ::PN syntax. This will help match proper nouns. Matches like the last one shown above indicate we need to refine further the concept myHost since we are getting matches for URL, Nespresso and we want to narrow this down to match for people’s names.
As I mentioned above, we can further refine the previous Example #5 and find better matches for good hosts, by defining the concept wonderfulHost as
Notice the addition of :PN for proper noun, :V for verb , and :A for adjective.
This rule type is used to extract information that occurs in a specific context or pattern. To extract we use “_c”. As with the rule type CONCEPT one can use morphological expansion operators and POS tags.
This example doesn’t use the hospitality dataset but illustrates this rule type very well.
Assume these are concepts already defined:
By defining the concept officialAndState as
C_CONCEPT: governmentOfficial _c{_cap _cap} :Prep stateUS
A match for this concept will be this phrase:
Governor Roy Cooper of North Carolina has proclaimed May 7-11, 2018 as National Teacher Appreciation Week
In this rule with use _cap which represents any capitalized word, and to extract two capitalized words we used _c{_cap _cap}. The string extracted is “Roy Cooper”. Also, notice that we use the POS tag :Prep
However, the following text would not produce a match because the word and is not a preposition: Senators Phineas Craymoor and Garrett Garcia from North Carolina pushed the bill through.
This rule type is more powerful than the previous ones, and therefore uses more computational resources. It uses Boolean and proximity operators to determine matches.
The most frequently used Boolean and Proximity Operators are: AND, OR, NOT, SENT (sentence), DIST (distance). Here are useful examples on how to use them.
NOT is used to find matches if the argument does not occur in the whole document. NOT must be used with the AND operator. Let’s define the concept trainStationInWalkingDistance as
SENT_n matches if all arguments occur IN A SENTENCE within n (or fewer) tokens of each other, regardless of their order.
DIST_N matches if all arguments occur within n (or fewer) tokens of each other, regardless of their order.
In Example #4 above, I used the concept convenientLocation which is defined as
This last example shows the importance to catch misspellings … and how challenging and interesting is to do Text Analytics!
I mentioned in my previous post “Analysis of Movie Reviews using Visual Text Analytics” as in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision desired.
My next article will be on rule types for Facts, be sure to check it out!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.