Tips and Tricks for Power Users of SAS® Visual Text Analytics: Part 1 of 3 (Structuring Concepts)

Structuring Concepts in SAS® Visual Text Analytics (VTA)

SAS® Visual Text Analytics contains a trove of features that empower the user to explore and understand unstructured text by categorizing documents, extracting information, and engineering relevant features that serve downstream predictive models. Such extreme facility sometimes begs the question of how to begin. To help answer that question, this series of articles will review suggestions for organizing, reviewing, and managing a VTA model gleaned from years of experience building projects of varying complexity – even one that processes millions of pages per day. For a more detailed guide to using the product check out the excellent book written by Teresa Jade, et al. entitled SAS Text Analytics for Business Applications: Concept Rules for Information Extraction Models. This first of a three-part series will focus on how to approach structuring an information extraction (IE) model using the VTA Concepts node.

Structuring Components of a VTA Model

When different people work on multiple projects it’s important to follow standards applied to taxonomy structure, custom concept names, and concept definition rules (CDRs). The strategies described here have been successfully implemented for multiple projects but are only meant to serve as a guideline.

Text Casing and Naming Strategy (Custom Concepts, CDR Components)

Write concept names in all caps to make them easier to recognize when they appear in a complex CDR. Conversely, any non-concept strings included in a rule are always written in lower case.

If you’re new to VTA, ensure any concept name is not a recognized word that might appear in a document you’re analyzing. For example, don’t name a custom concept “BASEBALL” or VTA will accidentally match the word “baseball” when it appears in a sample of text. Also, create concept names that easily communicate how the concept will be used (e.g., COMPANY_NAME). This makes it easier for other analysts to intuit its purpose.

The VTA user interface (UI) allows users to nest custom concepts and create useful hierarchies. One approach is to create three groups of custom concepts: helper, target, and disambiguation. These organize concepts by functionality and make it easier to interpret the CDRs they contain.

Helper Custom Concepts

These concepts include lexicons made of CLASSIFIER CDRs and/or other rule types designed to be building blocks for more complex rules found in target concepts. It’s easiest to prefix these concept names with a letter like “H”. For example, the H_COLORS concept would contain a list of rules like CLASSIFIER:red, CLASSIFIER:green, CLASSIFIER:blue. Adding a prefix differentiates the concept name from a word like “colors” and assigns the concept to one of the logical groupings described above.

Figure 1: Concepts are organized into hierarchical groups based on their intended use. The H_COLORS concept is an example of a helper concept containing a lexicon of CLASSIFIER rules.

Target Custom Concepts

These concepts contain rules that target strings you’re trying to extract contingent on specific contextual cues. Contextual cues are anchored by helper concepts or other strings added to a CDR. For example, as shown in Figure 2, the target concept T_COLOR extracts blue as the color of the product’s X32 model.

Figure 2: The target concept T_COLOR and match results for a text string in the Test Sample Text window.

Disambiguation Custom Concepts

Like the helper concepts, disambiguation concepts are a type of supporting concept. When needed you can set any of these concepts’ behavior to “Supporting” (see Figure 3) so VTA suppresses any results from that concept in the scored output. To make model development and debugging easier, only make this change when the model is complete and ready to be deployed. If you added canonical information to CLASSIFIER rules (e.g., CLASSIFIER: azul, blue) in a concept, or use matches for a concept in a post-processing step, don’t change its behavior to “Supporting”.

Figure 3: Right click on the concept name to set the concept behavior to “Supporting” and suppress its results in the scored output of your VTA model.

As the name implies, disambiguation concepts are used for contextual, or word-sense, disambiguation. They improve the precision of an IE model by preventing extraction of inappropriate text strings. Disambiguation concepts target the same strings as either helper or target concepts, but only when they appear in an irrelevant context. Preventing a match for a helper concept used in CDRs for a target concept is an efficient way to eliminate false positive matches in a target concept, but often the approach is too specific and fails in many instances.

Figure 4: The disambiguation concept D_COLORS is used to prevent a false positive match for the helper concept H_COLORS when one of its CDRs matches text in a sentence that would also match a rule in the T_COLOR concept.

Figure 5: The disambiguation concept D_COLOR matches the same string as T_COLOR, but only when the X32 model is not available.

Use a REMOVE_ITEM rule to leverage these disambiguation concepts in appropriate helper or target concepts. Although this type of CDR can be included at any point in the concept, because there are typically fewer REMOVE_ITEM rules it’s best to put them at the top of the list of CDRs. This makes it easier for someone reviewing, debugging, or editing a model to quickly recognize a REMOVE_ITEM rule exists within the concept.

Figure 6: Place REMOVE_ITEM rules first in the list of CDRs to make them easier to find. In this example the REMOVE_ITEM rule leverages the disambiguation concept D_COLOR to prevent a false positive match for T_COLOR.

Use Concept Headers and Comments

Computer programs usually contain a header that imparts key information to anyone who must understand, use, or debug the program. Concepts written for a VTA model use a form of coded language (i.e., LITI, language interpretation for text information, syntax) that also requires explanation. Whether this is recorded in the header or comments, thorough documentation prevents misinterpretation and ameliorates the challenge of solving future issues.

At a minimum, the header for each concept should contain the concept’s name, the date it was created, and its original author. Other useful information includes its purpose or what type of strings it’s targeting, which concept(s) it supports, and any relevant copyright language.

Figure 7: Adding information to a concept header shares important details with analysts who use or edit the VTA model.

Concept definition rules can be difficult to interpret without context. Therefore, it’s useful to provide examples of the type of string(s) a rule is meant to target. This provides a key to unlock the code and an example for unit or regression testing.

When new CDRs are added to a mature VTA project it’s useful to include a comment recording the date and the author’s name. Should this work introduce a breaking change, the date provides a recovery point for version control and the name provides a contact who can share the intended effect of this change.

Figure 8: Add examples in the comments and include the date and author of any changes to a mature VTA project.

Conclusion: Part 1

In this article I’ve shared information I hope will help you organize the structure of future Visual Text Analytics information extraction projects. This is by no means the only – or even best – way to approach their development, but it has served to organize and maintain many projects over the years. Please use these suggestions as you see fit and share any comments or suggestions you’ve found useful in your Visual Text Analytics practice. In the meantime, view Part 2 of this series focused on manipulating the content of SAS VTA projects and Part 3 of this series focused on tracking changes to a SAS VTA project.