Using REGEX Rules to Extract Recurring Patterns of Information from Textual Data

2 Likes

The purpose of this post is to illustrate how the REGEX LITI rule can be used to extract recurring patterns of information from textual data using SAS Visual Text Analytics (VTA) through the Model Studio interface. This post assumes readers have a basic familiarity with VTA software running in Model Studio.

The Concepts node is a powerful and useful node in a VTA pipeline that has the purpose of extracting information from unstructured, textual data. The node has 9 predefined concepts: nlpDate, nlpMeasure, nlpMoney, nlpNounGroup, nlpOrganization, nlpPercent, nlpPerson, nlpPlace, and nlpTime. These predefined concepts use natural language processing to extract different types of common information from a document collection. But what if these predefined concepts do not extract the exact information an analyst needs? Well, in addition to the predefined concepts, an analyst can also write custom concepts using the LITI (Language Interpretation for Textual Information) language. The LITI language for extracting information from a document collection is powerful, comprehensive, and contains far too many rules for a single post. In fact, entire books have been written on the LITI language¹. This post will feature the REGEX rule. REGEX, short for Regular Expressions, is a type of coding rule used to extract information in the form of recurring patterns. The term “Regular Expressions” first appeared in a 1951 publication titled Representation of Events in Nerve Nets and Finite Automata, written by mathematician Stephen Kleene. Perl Regular Expressions are a well know variation of the coding rules which have been used for decades^2,3 and were even incorporated into SAS 9 software circa 2000. REGEX rules are used to extract recurring patterns in textual data such as telephone numbers, email addresses, birthdates, social security numbers, serial numbers, and part numbers, just to name a few. This post will not teach you how to write REGEX rules from the ground up, rather it will provide examples and insights that will help you get started in using them within SAS VTA software. All examples provided in this post are performed within the Concepts node of a VTA pipeline.

Social Security Numbers:

To get started with some basic functionality and operators of REGEX rules, let’s do so in the context of a simple example. Through the example I’ll provide some useful and common operators (also called markers, special characters or metacharacters) and explain what they do. As stated above, it may be necessary to read several textbooks⁴ and take specific training classes⁵ to become proficient with writing Regular Expressions and other LITI rules. In this example, we want to extract American social security numbers from a document collection. We’ll define social security numbers such that they have 9 total digits in a specific pattern; 3 digits followed by a dash (-), followed by two digits and a dash (-), and ending in four digits. We want the REGEX rule to return only social security numbers, and not other numbers such as 10-digit phone numbers (which also may contain dashes) or dates.

Here’s the REGEX code written for a custom concept within the Concepts node in Model Studio:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In this code each \d marker returns a single digit, 0-9, and each \- marker returns a single dash (-). Writing three \d markers in a row returns exactly three digits, each 0-9, writing two in a row returns exactly two digits, and writing four in a row returns exactly four digits. We can test this rule on the following sentence, which contains a fake social security number as well as a fake phone number: “This rule will match a social security like 123-45-6789 but not a phone number like 123-456-7890”. We want the rule to return a match on the social security number only and not the telephone number. Model Studio also has a Test Sample Text feature which allows the analyst to test rules on sample documents, which may be as simple as a sentence or just a few words. Using the Test Sample Text feature rather than running the rule against the entire document collection can help with debugging and fine-tuning rules and can also save time, for example, when document collections are very large. Here’s the sample sentence in the Test Sample Text window of the Concepts node after the rule has been applied to it:

Returned matches in Model Studio are colored in blue. The rule is verified to work in that it returns a match only on the (fake) social security number and not the (fake) phone number. REGEX rules often have multiple markers that can be used to extract the same information. For extracting single numerical digits, the marker [0-9] could also be used in place of \d. So, an alternate way to write the rule for social security numbers is:

Parts Numbers:

In this example, let’s suppose the information to extract are parts numbers from a document collection consisting of written descriptions of parts and their uses in a manufacturing scenario. The multi-digit part numbers are always decimal values, separated by either a decimal (.) or a comma (,), and consist of varied numbers of digits before and after the decimal or comma. This example will require a rule that’s more flexible than the rule used for social security numbers. One such example of a part number is: 234242.12342143. Here is a hypothetical example of a document, in this case a single sentence, in the data to be analyzed: “The required ball bearing type to be used within the axle has part number 234242.12342143.”

Here’s the REGEX code written within the CONCEPTS node in Model Studio:

This is what each operator does:

\d matches any single digit (0-9)
+ matches the previous character one or more times (the previous character is a digit)
[\,\.] matches either a comma (,) or a decimal (.)
\d matches any single digit (0-9)
+ matches the previous character one or more times (the previous character is a digit)

So, when the sample text is tested, the match is on the part number including all digits before and after the decimal:

Even if the part number is written with a comma, the part number is still matched:

And of course, the rule would still match part numbers that have a single digit before and after the decimal.

To add a wrinkle to the example, suppose that the part number may or may not have a decimal or comma. Thus, we may want to extract a part number that is simply “37”. This can easily be handled by adding a question mark (?) to the rule which is a marker that makes the previous character optional. In this case, the adjusted REGEX rule would be:

The question mark makes the previous character, here a decimal or comma, optional. This adjusted rule returns two matches in the following sample text:

Email Addresses:

The final example will require the most flexible rule yet. Suppose we want to extract email addresses from a document collection. Email addresses could contain a variety of punctuations, for example periods or underscores, could be from different domains, such as .com, .gov, or .edu, and they all contain the @ symbol. There is no standard when it comes to the length of email addresses. Here’s a rule that would match most common email addresses:

Here is how the rule works:

[_a-zA-Z\d\_\-.] matches any lower-case letter (a-z), upper case letter (A-Z), digit (0-9), underscore (_), dash (-), or period (.)
+ matches the previous character one or more times (the previous character is defined above)
@ matches an @ symbol
[_a-zA-Z\d\-] matches any lower-case letter (a-z), upper case letter (A-Z), digit (0-9), or dash (-)
+ matches the previous character one or more times (the previous character is defined above)
\. matches a period (.)
[_a-zA-Z\d\-] matches any lower-case letter (a-z), upper case letter (A-Z), digit (0-9), or dash (-)
+ matches the previous character one or more times (the previous character is defined above)

In the following sample document, three email addresses are returned:

The LITI language is used to write custom concepts in the Concepts node using SAS VTA in Model Studio. It is a powerful natural language processing coding language for extracting information from unstructured, textual data. The REGEX rule is only one small part of the LITI language and it is specifically used to extract recurring patterns. This post is far from a complete guide on learning the REGEX LITI rule but it provides and explains some basic examples as a starting point to understanding and using the rule. For more examples using the REGEX rule and a more complete list of markers (a.k.a. special characters or metacharacters), see VTA product documentation⁶.

References

(1.) Jade, Teresa, Biljana Belamaric Wilsey, and Michael Wallis. 2019. SAS^® Text Analytics for Business Applications: Concept Rules for Information Extraction Models. Cary, NC: SAS Institute Inc.

(2.) Schwartz, Randal L. 1993. Learning Perl. Sebastopol, CA: O’Reilly and Associates, Inc.

(3.) Quigley, Ellie. 1998. PERL by Example. Upper Saddle River, New Jersey: Prentice Hall PTR.

(4.) Windham, K. Mathew, 2014, Introduction to Regular Expressions in SAS^®, Cary, NC: SAS Institute Inc.

(5.) SAS Course: SAS Visual Text Analytics in SAS Viya.

(6.) SAS Documentation: Using Regular Expressions (Regex).

Find more articles from SAS Global Enablement and Learning here.

Using REGEX Rules to Extract Recurring Patterns of Information from Textual Data

Ready to join fellow brilliant minds for the SAS Hackathon?

Free course: Data Literacy Essentials

Get Started