SAS Communities Library

We’re smarter together. Learn from this collection of community knowledge and add your expertise.
BookmarkSubscribeRSS Feed

Detailing your Data in SAS Studio Part 2: Cleaning up the Clutter

Started ‎04-07-2025 by
Modified ‎04-07-2025 by
Views 223

Let's be honest: if you work with data in any capacity, you've found data quality errors before. You've seen a variable that uses several different formats, a row that is fully uppercased when others are proper cased, or a proper noun that is spelled fifteen different ways throughout the data set. Identifying the issues is half of the battle - but what do you do now?

 

In my last post, I demonstrated how to use the Clean Data step in SAS Studio to analyze your data and quickly uncover data quality issues. This time, I’ll show you how to use the exact same step to effortlessly clean up your data in a few clicks with the casing and standardization operations.

 

Click Read more to continue our data quality journey with SAS Studio Flows!

 

The Clean Data Step (Review) and Scenario

 

Recall that the Clean Data step (available with the SAS Studio Engineer license) enables five operations based on the SAS Quality Knowledge Base (QKB): standardization, casing, gender analysis, identification analysis, and pattern analysis. Visit the documentation for a full description of step capabilities.

 

Last time, I used the analysis operations to quickly identify issues and summarize data in the CONTACTS table from the SASDQREF library.

 

01_grbarn_clean_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The findings include:

 

  • Casing is inconsistent in several columns (COMPANY, CONTACT, ADDRESS, STATE)
  • COMPANY names are written with many variations
  • STATE values switch between full names and abbreviations
  • PHONE values use the same standard pattern except for one single record
  • ADDRESS values lack a standard for expressing road names (i.e. Street vs St)
  • CITY lacks a standard for commonly abbreviated prefixes (i.e. South vs. S)

 

We’ll use the standardization and casing definitions to clean up these issues.

 

Casing

 

Case definitions convert text to uppercase, lowercase, or proper case. This may sound like an overly simple set of transformations. However, the QKB also includes specific proper case definitions for various data types including names, organizations, or addresses. In these cases, special capitalization rules may apply to certain values.

 

For example, proper casing the value BEN MCDONALD would return Ben Mcdonald, but applying the Proper (Name) definition would return Ben McDonald, which is standard for names with a Mc- prefix.

 

Review some casing examples below:

 

02_grbarn_clean_2.png

 

For more information on casing, visit the QKB documentation.

 

I’ll use the Proper (Name) case definition on CONTACT values. Some names are uppercased, which I want to correct.

 

03_grbarn_clean_3.png

 

Simply connect a Clean Data step to CONTACTS in your flow, select the ENUSA locale, and enable casing. Then, select the CONTACT column and the Proper (Name) case definition. I’ll create a new column to store my results.

 

04_grbarn_clean_4.png

 

The default output column name will be InputColumnName_CASE.

 

The results show that CONTACT names have been appropriately cased.

 

05_grbarn_clean_5.png

 

Notice that names with special casing rules (like Mc- prefixes) are also appropriately cased.

 

06_grbarn_clean_6.png

 

We’ll correct the remaining issues using standardization definitions.

 

Standardization

 

Standardization definitions apply QKB rules and logic to the individual tokens in an input text string.

 

For example, say I want to apply the Address standardization definition to the street address 100 SAS CAMPUS DR. The value would be split into the street number (100), street name (SAS CAMPUS), and street type (DR). Then, the tokens are transformed as necessary: the street name becomes SAS Campus, and the street type becomes Dr. The tokens are combined again to make the standardized value 100 SAS Campus Dr.

 

07_grbarn_clean_7.png

 

Standardization definitions are available for other data quality tasks, including masking data or removing specific characters. For more information on standardization, visit the QKB documentation.

 

I’ll update my flow to enable standardization, then start with applying the State/Province (Abbreviation) definition to STATE.

 

08_grbarn_clean_8.png

 

Under the Additional Standardize heading, I’ll configure settings for my other columns as follows:

 

  • PHONE variable --> Phone definition
  • ADDRESS variable -> Address definition
  • CITY variable -> City definition
  • COMPANY variable -> Organization definition

 

After running this step, I can compare the original columns to their new standardized output in the table viewer.

 

09_grbarn_clean_9.png

 

The STATE values now appear only as two-letter abbreviations, which is excellent compared to the wide variety of patterns found in my previous post. PHONE values are also uniformly standardized - notice that the Phone standardization definition uses a different format than the original column did.

 

10_grbarn_clean_10.png

 

ADDRESS and CITY have also been standardized nicely. The Address definition shortens directional prefixes (like East, West, North, and South) and road types (like Street, Road, Circle, or Avenue). The City definition does the opposite by spelling out directional prefixes in CITY names.

 

11_grbarn_clean_11.png

 

However, the Organization definition didn’t quite work for the COMPANY values. While row values now fit typical company name standards, we can still see several repeat values.

 

This issue can be solved in a few ways. If you don’t mind coding, check out my series on creating and applying custom schemes with PROC DQSCHEME. You can also check out the user-friendly point-and-click SAS QKB Definition Editor. In the Definition Editor, you can customize the Organization definition to include these values or create a new custom scheme or definition (without any coding). Customized QKBs can be saved and used in SAS Viya. To learn more, check out the workshops Understanding the SAS Quality Knowledge Base and Creating a New Data Type in the SAS Quality Knowledge Base.

 

Summary

 

In this post, I’ve demonstrated how to fix data quality problems with the Clean Data step. If you missed my previous post, give it a read to learn how this step can help you uncover your DQ issues. In the next part, I’ll show you how to retrieve specific tokens from your text data using the Parse Data step.

 

What semantic type of data do you spend the most time cleaning, if any? Does the Clean Data step help to accelerate your data preparation processes? Share your thoughts, questions, and feedback below!

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎04-07-2025 04:20 PM
Updated by:
Contributors

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started