Detailing your Data in SAS Studio Part 3: Parsing the Pertinent Information

1 Like

Data almost never arrives in the format you need. As a result, data preparation often requires separating large text values into unique variables which are more meaningful for analysis. Whether you want to parse an address into its parts or extract a phone number from a chunk of contact information, the Parse Data step is here to help!

In this post, I'll demonstrate how to use the point-and-click Parse Data step to parse and extract pertinent information to make the most of your data. Click Read more to continue our data quality journey with SAS Studio Flows!

The Parse Data Step

Much like the Clean Data step (the subject of my last post), the Parse Data step enables common data quality operations based on the SAS Quality Knowledge Base (QKB). In particular, this step supports parsing and extraction, which are used to retrieve specific portions of information (called tokens) from text strings. Both operations can be performed up to 10 times each in one node. Visit the documentation for more information on step capabilities.

The Parse Data step (and other data quality steps) are available with the SAS Studio Engineer license.

Scenario

In this series, I’ll be using CONTACTS, which is a sample table in the SASDQREF library. CONTACTS contains contact information like name, company, address, phone, and more.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Note: In my last post, I used the Clean Data step to standardize and case several variables in CONTACTS. For this post, I'll be starting with the raw CONTACTS table again to ensure that the following demonstration is easily replicable.

Parsing

Parse definitions separate text data strings into tokens, which are the smallest meaningful portions of a full value. For example, a full address contains tokens like the Recipient, Building, Street Address, City, State/Province, Postal Code, and more.

Parse definitions are available for several semantic types and variations of data. The ENUSA locale has parse definitions for multiple types of addresses, names, dates, and more, while also handling data like dimensions and websites.

For more information on parsing, visit the QKB documentation.

I’ll parse the CONTACT variable from DQREF.CONTACTS. Most contacts appear to include a first and last name, and a few include additional elements like middle initials.

In my flow, I’ve connected a Parse Data node to CONTACTS. Like with the Clean Data step, you must select a QKB locale before proceeding and enable the desired operation. I’ll use ENUSA (the default locale for my environment) and enable parsing.

Be mindful that configuration has an additional step here. I’ll select the CONTACT column and the Name definition, then I need to select the tokens I want in my output. I’ll select the Family Name, Given Name, Middle Name, and Prefix tokens. Note that you may have to scroll in the Select tokens menu to see all available tokens. You can also filter to view only your selected or unselected tokens.

An output table column will be generated for each selected token. The default name will be InputColumnName_Token.

After running the step, we can compare our original columns to the new token columns.

It looks like Family Name, Given Name, and existing Middle Name tokens were parsed correctly. However, the Prefix token column looks empty based on these first few records.

You can use additional nodes to examine variables like these. Here, I’ll use the Characterize Data step to see the frequency of Middle Name and Prefix tokens.

In the results, I can see that a handful of CONTACT values included prefixes or middle names. Prefixes appeared as different versions of “Mr” and “Ms”, while middle name tokens were mostly initials.

The largest takeaway is that most records (more than 3200) did not include these tokens, so we may want to exclude these elements from any future name analysis.

Extraction

Extraction definitions retrieve specific tokens from text data strings. For example, you could extract a person’s name, phone number, address, or any other valid tokens from a string which contains contact information. Note that these tokens can be larger than ones returned from parse definitions (i.e., you can extract a full name, but not a given name by itself).

Be mindful that an extraction definition can only pull out the existing information in the input string, not correct or enhance it. For example, if my string is missing a street address but includes the city, state, and postal code, extracting the “Address” token will only include the city, state, and postal code. Additionally, if the string includes invalid information in a valid format, like an email with a misspelled domain, that information will be extracted as-is.

For more information on extraction, visit the QKB documentation.

Extraction definitions are more specialized than other definition types. ENUSA has three definitions: Brand/Manufacturer, Contact Info, and Product Attributes. These definitions are based on semantic value types that are often grouped together. For example, a generic string of contact information will likely include details like Name, Address, Email, Phone, or more. Once these values are extracted, you can parse them into even smaller tokens.

To meaningfully demonstrate extraction, I’m going to create a temporary table, CONTACTS_EXTRACT, using the following code. This query puts most of the contact information from SASDQREF.CONTACTS into one string.

proc sql;
create table CONTACTS_EXTRACT as
select ID, catx("/", COMPANY, CONTACT, catx(', ',ADDRESS, CITY, STATE), PHONE) as CONTACT_INFO
    from SASDQREF.CONTACTS
    where input(ID,5.) <= 100
    order by input(ID,5.);
quit;

I’ll add the new data to my flow, connect a new Parse Data node, select the ENUSA locale, and enable extraction.

Next, I’ll select the new CONTACT_INFO column and the Contact Info definition, then select my tokens: Address, Name, Organization, and Phone.

Like with the parsing operation, the default output column name will be InputColumnName_Token.

Once I run the step, I can see that my extraction results are good, but not perfect. Some details ended up in the wrong spots.

Instead of extracting the name “Mr James Brigs”, only the name “Mr James” was extracted and “Brigs” was added to the Organization token (since it’s not recognized as a common last name). In another row, an address which shortened “Suite” to “Ste” (without a period to denote an abbreviation) accidentally skipped over the city and state information, which was added to the Name token instead.

These issues can be fixed in a few ways. When possible, you can standardize your data prior to extraction, which makes it easier for the QKB to recognize which text goes with which token. You can also standardize your tokens after extraction, re-parse or re-extract to identify tokens accurately, then do some data preparation to put values back in the right place. You can also customize your QKB if you find particular values which the QKB consistently mischaracterizes.

What if all of the addresses had been extracted correctly from CONTACT_INFO? Then, I could use another Parse Data step to parse the full address tokens.

I’d use the CONTACT_INFO_ADDRESS column as my input, then select the Address (Full) definition. Note that this definition can parse ten individual tokens. Here, I've selected City, State/Province, and Street.

In the results, address tokens were parsed correctly despite differences in representation and appearance.

Summary

In this post, I’ve demonstrated how to parse and extract tokens from your data using the Parse Data step. If you want to learn more about improving your data quality, check out the first two posts in this series on analyzing your data for issues and correcting imperfections with standardization and case definitions. In the next part, I’ll show you how to generate match codes using the Match Codes step (and what to do with them after they're created).

Do you often need to extract portions of text data for analysis? Would the Parse Data step replace any parts of your data preparation methodology? Share your thoughts, questions, and feedback below!

Find more articles from SAS Global Enablement and Learning here.

Detailing your Data in SAS Studio Part 3: Parsing the Pertinent Information

Registration is open

SAS AI and Machine Learning Courses