Coding for Data Quality in SAS Viya Part 1 – Fundamentals

3 Likes

In my 30 years of SAS programming experience, I’ve spent an enormous amount of time writing, testing, and refining data-cleaning code. I was aware that SAS had a data quality (DQ) product, but it wasn’t available in my company’s SAS installations, so I just ignored SAS DQ and did it all myself. And don’t get me wrong – you can really clean some data using Base SAS tools! But writing custom code for fuzzy matching and data standardization can be an onerous and repetitive chore. Recently, while kicking the tires on a new version of SAS Viya, I made an amazing discovery – SAS Data Quality ships with all Viya configurations – and has done so all along! The purpose of this blog is to introduce seasoned SAS programmers like me to the amazing, time-saving SAS Data Quality tools available in Viya.

SAS DQ provides pre-packaged elements for fuzzy matching and standardization, making it easy to produce higher-quality data. Many SAS Viya applications, such as SAS Data Preparation and Event Stream Processing, use SAS DQ elements in their processing. For coders, SAS DQ language elements greatly simplify parsing, identification analysis, element extraction, data standardization and casing, fuzzy matching, and more.

At the heart of it all is the Quality Knowledge Base, or QKB. The QKB currently being distributed with SAS Viya is the “SAS Quality Knowledge Base for Contact Information 33”. A QKB contains a collection of rules and reference data which provide powerful tools for parsing text, identifying the contents of parsed sub-strings, and applying appropriate rules for casing, standardization, and matching.

In the following simple example, I have some names that have been entered in an inconsistent order, and are not properly cased:

boaty b. mcboatface
von boatface, boatie boat

The goal is to produce a usable dataset containing properly cased first name, middle initial, and last name columns. This would be a daunting task using only base SAS techniques, but SAS Data Quality makes the task almost trivial. Here I’ve used the DQPARSE function to automatically identify the first name, last name, and middle initial from the input text and the DQCASE function to properly capitalize the extracted values, all in a few lines of code:

data Names;
	/* Set lengths for the new desired columns */
	length Last First $50 MI $1;
	input;
	/* Let the QKB determine the parts of each name */
	_parsedName=dqParse(_infile_, 'Name');
	/* Grab the last name and properly case the value */
	Last =dqCase(dqParseTokenGet(_parsedName, 'Family Name'
                  , 'Name'), 'Proper (Name)');
	/* Grab the first name and properly case the value */
	First=dqCase(dqParseTokenGet(_parsedName, 'Given Name'
                  , 'Name'), 'Proper (Name)');
	/* Grab the middle initial and upper case the value*/
	MI=upcase(dqParseTokenGet(_parsedName, 'Given Name', 'Name'));
	drop _:;
datalines4;
von boatface, boatie boat
boaty b. mcboatface
;;;;

Results:

Obs	Last	First	MI
1	von Boatface	Boatie	B
2	McBoatface	Boaty	B

Notice the proper parsing and formatting of “von Boatface”? The QKB made that so easy to code!

To do its work, the QKB needs to know what language to expect and from what geographic region. For example, the Portuguese spoken in Brazil differs significantly from the Portuguese spoken in Portugal. The combination of language and geography is called a locale, and a QKB usually contains information for myriad different locales. Notice that I didn’t specify a locale in my code, yet everything processed beautifully. But I’m curious - what QKB and locale was used to process my text? PROC OPTIONS with the GROUP= option will provide that information. I’ll submit this code:

proc options group=dataquality;
run;

And see the results in the Log:

DQLOCALE=(ENUSA)  Specifies the Data Quality Server ordered list of locales for data cleansing.
...
DQSETUPLOC=QKB CI 33
Specifies the location of the Quality Knowledge Base root directory.

So, the QKB name is QKB CI 33, and the locale in use is ENUSA. I can find documentation for the QKB online in the documentation for the "SAS Quality Knowledge Base for Contact Information 33". In the appendix titled “QKB Locale ISO Codes,” I find that “ENUSA” is the code for “English, United States”. This is so easy that I find it quite exciting! The latest SAS Data Quality documentation is definitely worth a read! Have you been using SAS DQ functionality in your SAS code? What's the SAS DQ feature you find most compelling? In the next blog, I think I’ll dive into data standardization in SAS Viya, including working with SAS DQ in CAS. Until then, may the SAS be with you!

Read the rest of the series:

Part 2 – Standardization

Part 3 – Fuzzy Matching with Match Codes

Part 4 – Extracting Data from Text Files

Quentin · ‎08-23-2023

Wow, I also have never looked at SAS DQ. This is a great example of how Viya has been designed for SAS programmers! Can't wait to see more SAS programming in Viya at SAS Explore.

SASJedi · ‎08-23-2023

@Quentin - I know, it was quite a revelation to me, too! Glad you found the example useful. And there should be tons more of things like this at SAS Explore 😁

Coding for Data Quality in SAS Viya Part 1 – Fundamentals

Read the rest of the series:

Free course: Data Literacy Essentials

Get Started