Detailing your Data in SAS Studio Part 2: Cleaning up the Clutter
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Let's be honest: if you work with data in any capacity, you've found data quality errors before. You've seen a variable that uses several different formats, a row that is fully uppercased when others are proper cased, or a proper noun that is spelled fifteen different ways throughout the data set. Identifying the issues is half of the battle - but what do you do now?
In my last post, I demonstrated how to use the Clean Data step in SAS Studio to analyze your data and quickly uncover data quality issues. This time, I’ll show you how to use the exact same step to effortlessly clean up your data in a few clicks with the casing and standardization operations.
Click Read more to continue our data quality journey with SAS Studio Flows!
The Clean Data Step (Review) and Scenario
Recall that the Clean Data step (available with the SAS Studio Engineer license) enables five operations based on the SAS Quality Knowledge Base (QKB): standardization, casing, gender analysis, identification analysis, and pattern analysis. Visit the documentation for a full description of step capabilities.
Last time, I used the analysis operations to quickly identify issues and summarize data in the CONTACTS table from the SASDQREF library.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The findings include:
- Casing is inconsistent in several columns (COMPANY, CONTACT, ADDRESS, STATE)
- COMPANY names are written with many variations
- STATE values switch between full names and abbreviations
- PHONE values use the same standard pattern except for one single record
- ADDRESS values lack a standard for expressing road names (i.e. Street vs St)
- CITY lacks a standard for commonly abbreviated prefixes (i.e. South vs. S)
We’ll use the standardization and casing definitions to clean up these issues.
Casing
Case definitions convert text to uppercase, lowercase, or proper case. This may sound like an overly simple set of transformations. However, the QKB also includes specific proper case definitions for various data types including names, organizations, or addresses. In these cases, special capitalization rules may apply to certain values.
For example, proper casing the value BEN MCDONALD would return Ben Mcdonald, but applying the Proper (Name) definition would return Ben McDonald, which is standard for names with a Mc- prefix.
Review some casing examples below:
For more information on casing, visit the QKB documentation.
I’ll use the Proper (Name) case definition on CONTACT values. Some names are uppercased, which I want to correct.
Simply connect a Clean Data step to CONTACTS in your flow, select the ENUSA locale, and enable casing. Then, select the CONTACT column and the Proper (Name) case definition. I’ll create a new column to store my results.
The default output column name will be InputColumnName_CASE.
The results show that CONTACT names have been appropriately cased.
Notice that names with special casing rules (like Mc- prefixes) are also appropriately cased.
We’ll correct the remaining issues using standardization definitions.
Standardization
Standardization definitions apply QKB rules and logic to the individual tokens in an input text string.
For example, say I want to apply the Address standardization definition to the street address 100 SAS CAMPUS DR. The value would be split into the street number (100), street name (SAS CAMPUS), and street type (DR). Then, the tokens are transformed as necessary: the street name becomes SAS Campus, and the street type becomes Dr. The tokens are combined again to make the standardized value 100 SAS Campus Dr.
Standardization definitions are available for other data quality tasks, including masking data or removing specific characters. For more information on standardization, visit the QKB documentation.
I’ll update my flow to enable standardization, then start with applying the State/Province (Abbreviation) definition to STATE.
Under the Additional Standardize heading, I’ll configure settings for my other columns as follows:
- PHONE variable --> Phone definition
- ADDRESS variable -> Address definition
- CITY variable -> City definition
- COMPANY variable -> Organization definition
After running this step, I can compare the original columns to their new standardized output in the table viewer.
The STATE values now appear only as two-letter abbreviations, which is excellent compared to the wide variety of patterns found in my previous post. PHONE values are also uniformly standardized - notice that the Phone standardization definition uses a different format than the original column did.
ADDRESS and CITY have also been standardized nicely. The Address definition shortens directional prefixes (like East, West, North, and South) and road types (like Street, Road, Circle, or Avenue). The City definition does the opposite by spelling out directional prefixes in CITY names.
However, the Organization definition didn’t quite work for the COMPANY values. While row values now fit typical company name standards, we can still see several repeat values.
This issue can be solved in a few ways. If you don’t mind coding, check out my series on creating and applying custom schemes with PROC DQSCHEME. You can also check out the user-friendly point-and-click SAS QKB Definition Editor. In the Definition Editor, you can customize the Organization definition to include these values or create a new custom scheme or definition (without any coding). Customized QKBs can be saved and used in SAS Viya. To learn more, check out the workshops Understanding the SAS Quality Knowledge Base and Creating a New Data Type in the SAS Quality Knowledge Base.
Summary
In this post, I’ve demonstrated how to fix data quality problems with the Clean Data step. If you missed my previous post, give it a read to learn how this step can help you uncover your DQ issues. In the next part, I’ll show you how to retrieve specific tokens from your text data using the Parse Data step.
What semantic type of data do you spend the most time cleaning, if any? Does the Clean Data step help to accelerate your data preparation processes? Share your thoughts, questions, and feedback below!
Find more articles from SAS Global Enablement and Learning here.