Coding for Data Quality in SAS Viya Part 2 - Standardization

3 Likes

In my long SAS programming career, I’ve found the 80-20 rule applies when it comes to working with data: 80% of your time is spent finding, cleaning, munging, and otherwise prepping the data, and the other 20% is spent doing the actual reporting, analysis, etc. The SAS Data Quality features in SAS Viya make that "80%" part much easier and faster. After acquiring some data, often the first step required is to standardize text values to make munging and reporting easier. This might involve rearranging words, like re-writing individual’s names in first name, last name order, changing individual words to abbreviations or vice-versa, and applying complex casing rules. The transformations applied to text often vary with the semantic type of the data. For example, if the string "Virginia" is a US state name, then you probably want it transformed to "VA". But if it’s an individual’s name instead, for example, "DOE, VIRGINIA", then we would want "Virginia Doe" as the result instead. And speaking of casing – the SAS PROPCASE function is excellent, but how do you code it to handle ‘McDonald’ or ‘von Trapp’? Fortunately, there are SAS Data Quality standardization definitions tailored for specific semantic types of the input text which can apply complex, pre-defined rules to standardize values with very little code required.

First, let’s have a peek at the data we’ve got to work with:

Row	Name	Address	City	State	Zip Code
1	CAT THOMAS	3005 BROOKSIDE DRIVE	GUIN	ALABAMA	35563
2	VON NEUMANN, JOHN	4026 MAPLE LANE	HUNTSVILLE	ALABAMA	35802
3	SAM TENNYSON	2715 MULBERRY AVE	DONALDSON	ARKANSAS	71941
4	CURRY, CHRISTY	399 MASONIC HILL ROAD	LITTLE ROCK	ARKANSAS	72212
5	WANDA SMITH	2190 CEDAR ST	PINE BLUFF	ARKANSAS	71601

I notice that the values are all upper case. Other problems include having people’s names entered inconsistently, addresses that don’t use standard abbreviations, and the State column values include the whole state name instead of just the two-letter postal code. This data could really use some standardization!

Before we start, let’s check the existing Data Quality System options.

proc options group=dataquality;
run;

Partial Log:

DQLOCALE=(ENUSA)
Specifies the Data Quality Server ordered list of locales for data cleansing.
DQSETUPLOC=QKB CI 33
Specifies the location of the Quality Knowledge Base root directory.

I’ll use the %DQLOAD autocall macro to make the locales I want to work with available in memory.

%DQLOAD(DQSETUPLOC="QKB CI 33", DQLOCALE=(ENUSA PTBRA), DQINFO=1);

Partial Log:

DQINFO: DQLOCALE parm that was passed in to the DQLOAD
DQINFO: macro: (ENUSA PTBRA)
DQINFO: DQSETUPLOC parm that was passed in to the DQLOAD
DQINFO: macro: “QKB CI 33”
DQINFO: Need to set the DQLOCALE option
…
DQINFO: Before = (ENUSA)
DQINFO: After = (ENUSA PTBRA)
DQINFO: Need to set the DQSETUPLOC option
…
DQINFO: Before = QKB CI 33
DQINFO: After = QKB CI 33
DQINFO: Need to load the locale(s)
…
DQINFO: Done loading locale(s).

Next, I’ll use the DQLOCALEINFOLIST function to get the names of the standardization definitions available in the ENUSA locale.

data _null_;
rc=DQLOCALEINFOLIST('STANDARDIZATION', 'ENUSA');
run;

The list is long! For now, I’ll focus on the ones associated with names and addresses.

Partial Log:

OPERATION: Definition Name
—————————
STANDARDIZATION: Address
…
STANDARDIZATION: City
STANDARDIZATION: City – State/Province – Postal Code
…
STANDARDIZATION: Country
STANDARDIZATION: Country (ISO 2 Char)
STANDARDIZATION: Country (ISO 3 Char)
…
STANDARDIZATION: Name
…
STANDARDIZATION: State/Province (Abbreviation)
STANDARDIZATION: State/Province (Full Name)

We can use the DQSTANDARDIZE function to apply standardization definitions to our data for quick and easy cleanup. Let’s use the Name definition on the values in our Name column :

proc FedSQL number;
select Name
,DQSTANDARDIZE(Name,'Name', 'ENUSA') as StdName
from cfdq2.mailinglist limit 5;
quit;

Results:

Row	Name	STDNAME
1	CAT THOMAS	Cat Thomas
2	VON NEUMANN, JOHN	John von Neumann
3	SAM TENNYSON	Sam Tennyson
4	CURRY, CHRISTY	Christy Curry
5	WANDA SMITH	Wanda Smith

Wow - that fixed the casing and put the values in First Name, Last Name format, all at once! I wonder how well the Address definition will work on our Address column values.

proc FedSQL number;
select Address
,DQSTANDARDIZE(Address,'Address', 'ENUSA') as StdAddress
from cfdq2.mailinglist limit 5;
quit;

Results:

Row	Address	STDADDRESS
1	3005 BROOKSIDE DRIVE	3005 Brookside Dr
2	4026 MAPLE LANE	4026 Maple Ln
3	2715 MULBERRY AVE	2715 Mulberry Ave
4	399 MASONIC HILL ROAD	399 Masonic Hill Rd
5	2190 CEDAR ST	2190 Cedar St

Very nice! Everything is properly cased, and standard abbreviations were applied for lane, drive, avenue, etc. Finally, let's test using the City definition to clean up the City values, and the 'State/Province (Abbreviation)’ to produce the appropriate two-letter state abbreviations from the State values.

proc FedSQL number;
select City
,DQSTANDARDIZE(City,'City', 'ENUSA') as StdCity
,State
,DQSTANDARDIZE(State,'State/Province (Abbreviation)', 'ENUSA') as StdState
from cfdq2.mailinglist limit 5;
quit;

Results:

Row	City	STDCITY	State	STDSTATE
1	GUIN	Guin	ALABAMA	AL
2	HUNTSVILLE	Huntsville	ALABAMA	AL
3	DONALDSON	Donaldson	ARKANSAS	AR
4	LITTLE ROCK	Little Rock	ARKANSAS	AR
5	PINE BLUFF	Pine Bluff	ARKANSAS	AR

That looks great! OK, I’m ready to clean up the whole data set:

data cfdq2.mailinglist_clean;
/* Use LENGTH to control variable length and logical order */
length Name $50 Address $30 City $20 State $2 Zip 8 Country $3;
/*Rename Country and State (need new lengths for cleaned values)*/
set cfdq2.mailinglist (rename=(Country=_c State=_s));
/* Drop variables with names that start with underscore */
drop _:;
/* Standardize values with the appropriate definition */
Name=DQSTANDARDIZE(Name,'Name', 'ENUSA');
Address=DQSTANDARDIZE(Address,'Address', 'ENUSA');
City=DQSTANDARDIZE(City,'City', 'ENUSA');
State=DQSTANDARDIZE(_s,'State/Province (Abbreviation)', 'ENUSA');
Country=DQSTANDARDIZE(_c,'Country (ISO 3 Char)', 'ENUSA');
run;

Results:

Obs	Name	Address	City	State	Zip Code	Country
1	Cat Thomas	3005 Brookside Dr	Guin	AL	35563	USA
2	John von Neumann	4026 Maple Ln	Huntsville	AL	35802	USA
3	Sam Tennyson	2715 Mulberry Ave	Donaldson	AR	71941	USA
4	Christy Curry	399 Masonic Hill Rd	Little Rock	AR	72212	USA
5	Wanda Smith	2190 Cedar St	Pine Bluff	AR	71601	USA
6	Jim Hoadley	1078 Davis Ave	Concord	CA	94520	USA
7	Calvin Wenzel	1007 Thompson Dr	Dublin	CA	94568	USA
8	Jennifer Ryan	4974 Water St	Fremont	CA	94539	USA
9	bleep Reyes	4819 Kerry Way	Los Angeles	CA	90017	USA
10	James Attwood	3238 Heritage Rd	San Joaquin	CA	93660	USA
11	Echo Kaye	3215 Thompson Dr	San Leandro	CA	94578	USA
12	Anna Christian	3323 Morgan St	Fort Walton Beach	FL	32548	USA
13	Linda Cheeks	3385 Rinehart Rd	Miami	FL	33128	USA
14	Agnes Yates	4334 Chestnut St	Polk City	FL	33868	USA
15	Regina Snider	3697 Rinehart Rd	Sunrise	FL	33323	USA
16	Nichole Robinson	622 Chestnut St	Tampa	FL	33610	USA
17	Kevin Ramirez	1813 Flint St	Atlanta	GA	30303	USA
18	Kris Carolan	3935 Lakeland Park Dr	Duluth	GA	30097	USA
19	Thomas Jenkins	2565 Pin Oak Dr	Davenport	IA	52803	USA
20	Michael Tillman	4854 Poplar St	Calumet City	IL	60409	USA

And my data is beautifully standardized, using just a few lines of code.

But wait! You say you have process data that is in CAS instead of the Compute Server? Well, no worries. The DATA step code you need is almost identical - you just have to make sure the DATA step reads from and writes to a SAS library that uses the CAS engine.

/* Connect to CAS & load the source data to the CASUSER caslib */
cas conn;
proc casutil sessref=conn;
load data=cfdq2.mailinglist
casout="mailinglist" outcaslib="casuser" replace;
list tables incaslib="casuser";
run;
/* Process in CAS with the DATA step */
/* Change the LIBREF to point to a CAS engine library (casuser)*/
libname casuser cas caslib="casuser" sessref=conn;
data casuser.mailinglist_clean;
/* Use LENGTH to control variable length and logical order */
length Name $50 Address $30 City $20 State $2 Zip 8 Country $3;
/*Rename Country and State (need new lengths for cleaned values)*/
set casuser.mailinglist (rename=(Country=_c State=_s));
/* Drop variables with names that start with underscore */
drop _:;
/* Standardize values with the appropriate definition */
Name=DQSTANDARDIZE(Name,'Name', 'ENUSA');
Address=DQSTANDARDIZE(Address,'Address', 'ENUSA');
City=DQSTANDARDIZE(City,'City', 'ENUSA');
State=DQSTANDARDIZE(_s,'State/Province (Abbreviation)', 'ENUSA');
Country=DQSTANDARDIZE(_c,'Country (ISO 3 Char)', 'ENUSA');
run;

Partial Log:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

Success - our DATA step ran multi-threaded in CAS! So, there you have it - data standardization is made so much easier by SAS Viya's SAS Data Quality features. Until the next time, may the SAS be with you!

Read the rest of the series:

Part 1 – Fundamentals

Part 2 – Standardization - This one!

Part 3 – Fuzzy Matching with Match Codes

Part 4 – Extracting Data from Text Files

Find more articles from SAS Global Enablement and Learning here.

Coding for Data Quality in SAS Viya Part 2 - Standardization

Read the rest of the series:

SAS Innovate 2025: Call for Content

Free course: Data Literacy Essentials

Get Started