Detailing your Data in SAS Studio Part 1: Identifying the Issues

2 Likes

Big data sometimes leads to big messes. What can you do when qualitative values in your data are inconsistently formatted, incorrectly parsed, or needlessly repeated? Furthermore, how can you identify these issues to begin with?

The Clean Data step is a multi-faceted point-and-click tool for data analysis and data cleaning in SAS Studio. In this post, I’ll demonstrate how to profile your data with the analysis features. If you’re interested in exploring data quality features in SAS Viya or you’re looking for an alternative to SAS Data Studio, click Read more to follow along!

The Clean Data Step and the QKB

The SAS Quality Knowledge Base (QKB) is a collection of files that store rules, logic, and reference data which are used to perform data quality operations. The Clean Data step is one of several flow steps that enables users to take advantage of the QKB.

This step enables five operation types: standardization, casing, gender analysis, identification analysis, and pattern analysis. In this post, I’ll use the three analysis operations to profile my sample data. For reference, all five operations can be performed up to 10 times each in one Clean Data node. Visit the documentation for more information on step capabilities.

Scenario

In this series, I’ll be using CONTACTS, which is a sample table in the SASDQREF library. CONTACTS contains contact information like name, company, address, phone, and more.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Observing the first few rows reveals that the data is imperfect. What issues do you notice? Some problems include:

Casing is inconsistent in several columns
Company names are written with many variations
State values switch between full names and abbreviations

We'll use the various analysis definitions to get a clear picture on the quality of this data.

Pattern Analysis

Pattern analysis identifies patterns in character strings by substituting text with pattern symbols. Patterns can be identified either by every character or every word (where "word" means grouping of the same type of character) in the string. For example, a product code BCD-1234 would have the character pattern AAA*9999 and the word pattern A*9.

This can be useful to quickly profile columns where we’d expect values to follow a distinct pattern, like phone numbers. Phone numbers in the United States are ten digits long and commonly stored as two sets of three digits and one set of four digits, with each set being separated by a symbol. If the column values are inconsistently formatted, we can identify the most common pattern to use for standardization.

For more information on pattern analysis, visit the QKB documentation.

I’ll analyze patterns in STATE and PHONE from CONTACTS.

STATE appears to be lacking a standard – some state names are fully spelled out, while others appear in two-letter abbreviations. PHONE appears to follow the expected standard, but I’d like to check all values in the column.

In my flow, I’ve connected a Clean Data step to CONTACTS, selected the ENUSA locale, and enabled pattern analysis. I’ll analyze STATE using the Character definition, so I can identify the distinct abbreviation formats (or lack thereof). Then, I’ll analyze PHONE with the Word definition, so I can see if the values appear as three groups of numbers separated by two symbols.

Note that the default output column names will be InputColumnName_PATTERN.

I’ll also add a Characterize Data step to quickly summarize the output columns. To learn about the Characterize Data step, visit the documentation.

The output shows that 96% of STATE values appear as two-letter abbreviations, but the remaining values vary greatly in representation.

PHONE values appear much more consistent – all but one value fits the expected standard.

Identification Analysis

Identification analysis is used to identify or validate the semantic type of information stored in a string. For example, the string 10/18/2007 could be identified as a DATE, or it could be recognized as a VALID date in MDY format.

For more information on identification analysis, visit the QKB documentation.

I’ll analyze CONTACT column values. Currently, I'm expecting all values to represent individual people but there are visible inconsistencies in the format of names, which makes me want to validate this column. Additionally, some records may list companies, department names, or job titles as contacts.

I’ll update the flow to add identification analysis on the CONTACT column, using the Field Content definition.

After updating the Characterize Data step to include the new output variable, we see that most names were identified as INDIVIDUAL, but some received other output.

I’ll investigate further by adding a Query node and filtering to retrieve all rows where CONTACT_ID was NOT equal to INDIVIDUAL.

The results show that these CONTACT values contained elements like uncommon names, last names that sound like road names, or non-letter symbols. These discrepancies could be corrected by customizing the QKB to account for uncommon names and processing this column to remove symbols.

Gender Analysis

Now that we’ve determined that all CONTACT values appear to represent individuals, we can perform gender analysis on this column. Gender analysis uses algorithms to identify an individual’s gender based on their full name. These algorithms consider historically gendered names, titles, and suffixes. For example, Mrs. is a common female title, while Jr. is a common male suffix.

For more information on gender analysis, visit the QKB documentation.

I’ll update my flow to add gender analysis on the CONTACT column. Notice that there’s only one valid definition for gender analysis – Name.

After adding CONTACT_GNDR to the Characterize Data step, the results show that the percentage of F and M results were close. However, roughly 20% of the names were listed as U for unknown. Like I did previously, I’ll run a query to filter only for the unknown output.

The results simply show that many of these names did not have enough information to identify gender. Most CONTACT values with unknown gender use a first initial instead of a full first name. Others use names which are not recognized as gendered by the QKB.

Summary

In this post, I’ve demonstrated part of the power of the Clean Data step and the QKB by using analysis definitions to profile the CONTACTS data set. In the next part, I’ll discuss how to use the rest of this step to fix some of these issues we’ve identified by standardizing and casing. Interested in more content on data quality? Check out these posts on the QKB:

Which analysis definitions would you find the most useful for your data preparation process? Share your thoughts, questions, and feedback below!

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library