BookmarkSubscribeRSS Feed

Coding for Data Quality in SAS Viya Part 2 - Standardization

Started ‎09-12-2023 by
Modified ‎12-07-2023 by
Views 1,021

In my long SAS programming career, I’ve found the 80-20 rule applies when it comes to working with data: 80% of your time is spent finding, cleaning, munging, and otherwise prepping the data, and the other 20% is spent doing the actual reporting, analysis, etc. The SAS Data Quality features in SAS Viya make that "80%" part much easier and faster. After acquiring some data, often the first step required is to standardize text values to make munging and reporting easier. This might involve rearranging words, like re-writing individual’s names in first name, last name order, changing individual words to abbreviations or vice-versa, and applying complex casing rules. The transformations applied to text often vary with the semantic type of the data. For example, if the string "Virginia" is a US state name, then you probably want it transformed to "VA". But if it’s an individual’s name instead, for example, "DOE, VIRGINIA", then we would want "Virginia Doe" as the result instead. And speaking of casing – the SAS PROPCASE function is excellent, but how do you code it to handle ‘McDonald’ or ‘von Trapp’? Fortunately, there are SAS Data Quality standardization definitions tailored for specific semantic types of the input text which can apply complex, pre-defined rules to standardize values with very little code required.

 

First, let’s have a peek at the data we’ve got to work with:

 

Row Name Address City State Zip Code
1 CAT THOMAS 3005 BROOKSIDE DRIVE GUIN ALABAMA 35563
2 VON NEUMANN, JOHN 4026 MAPLE LANE HUNTSVILLE ALABAMA 35802
3 SAM TENNYSON 2715 MULBERRY AVE DONALDSON ARKANSAS 71941
4 CURRY, CHRISTY 399 MASONIC HILL ROAD LITTLE ROCK ARKANSAS 72212
5 WANDA SMITH 2190 CEDAR ST PINE BLUFF ARKANSAS 71601

 

I notice that the values are all upper case. Other problems include having people’s names entered inconsistently, addresses that don’t use standard abbreviations, and the State column values include the whole state name instead of just the two-letter postal code. This data could really use some standardization!

 

Before we start, let’s check the existing Data Quality System options.

  

proc options group=dataquality;
run;

 

Partial Log:

DQLOCALE=(ENUSA)
Specifies the Data Quality Server ordered list of locales for data cleansing.
DQSETUPLOC=QKB CI 33
Specifies the location of the Quality Knowledge Base root directory.

 

I’ll use the %DQLOAD autocall macro to make the locales I want to work with available in memory. 

 

%DQLOAD(DQSETUPLOC="QKB CI 33", DQLOCALE=(ENUSA PTBRA), DQINFO=1);

 

Partial Log:

DQINFO: DQLOCALE parm that was passed in to the DQLOAD
DQINFO: macro: (ENUSA PTBRA)
DQINFO: DQSETUPLOC parm that was passed in to the DQLOAD
DQINFO: macro: “QKB CI 33”
DQINFO: Need to set the DQLOCALE option
…
DQINFO: Before = (ENUSA)
DQINFO: After = (ENUSA PTBRA)
DQINFO: Need to set the DQSETUPLOC option
…
DQINFO: Before = QKB CI 33
DQINFO: After = QKB CI 33
DQINFO: Need to load the locale(s)
…
DQINFO: Done loading locale(s).

 

Next, I’ll use the DQLOCALEINFOLIST function to get the names of the standardization definitions available in the ENUSA locale.

 

data _null_;
rc=DQLOCALEINFOLIST('STANDARDIZATION', 'ENUSA');
run;

 

The list is long! For now, I’ll focus on the ones associated with names and addresses.

 

Partial Log:

OPERATION: Definition Name
—————————
STANDARDIZATION: Address
…
STANDARDIZATION: City
STANDARDIZATION: City – State/Province – Postal Code
…
STANDARDIZATION: Country
STANDARDIZATION: Country (ISO 2 Char)
STANDARDIZATION: Country (ISO 3 Char)
…
STANDARDIZATION: Name
…
STANDARDIZATION: State/Province (Abbreviation)
STANDARDIZATION: State/Province (Full Name)

 

We can use the DQSTANDARDIZE function to apply standardization definitions to our data for quick and easy cleanup. Let’s use the Name definition on the values in our Name column :

 

proc FedSQL number;
select Name
,DQSTANDARDIZE(Name,'Name', 'ENUSA') as StdName
from cfdq2.mailinglist limit 5;
quit;

 

Results:

Row Name STDNAME
1 CAT THOMAS Cat Thomas
2 VON NEUMANN, JOHN John von Neumann
3 SAM TENNYSON Sam Tennyson
4 CURRY, CHRISTY Christy Curry
5 WANDA SMITH Wanda Smith

 

Wow - that fixed the casing and put the values in First Name, Last Name format, all at once! I wonder how well the Address definition will work on our Address column values.

 

proc FedSQL number;
select Address
,DQSTANDARDIZE(Address,'Address', 'ENUSA') as StdAddress
from cfdq2.mailinglist limit 5;
quit;

 

Results:

Row Address STDADDRESS
1 3005 BROOKSIDE DRIVE 3005 Brookside Dr
2 4026 MAPLE LANE 4026 Maple Ln
3 2715 MULBERRY AVE 2715 Mulberry Ave
4 399 MASONIC HILL ROAD 399 Masonic Hill Rd
5 2190 CEDAR ST 2190 Cedar St

 

Very nice! Everything is properly cased, and standard abbreviations were applied for lane, drive, avenue, etc. Finally, let's test using the City definition to clean up the City values, and the 'State/Province (Abbreviation)’ to produce the appropriate two-letter state abbreviations from the State values.

 

proc FedSQL number;
select City
,DQSTANDARDIZE(City,'City', 'ENUSA') as StdCity
,State
,DQSTANDARDIZE(State,'State/Province (Abbreviation)', 'ENUSA') as StdState
from cfdq2.mailinglist limit 5;
quit;

 

Results:

Row City STDCITY State STDSTATE
1 GUIN Guin ALABAMA AL
2 HUNTSVILLE Huntsville ALABAMA AL
3 DONALDSON Donaldson ARKANSAS AR
4 LITTLE ROCK Little Rock ARKANSAS AR
5 PINE BLUFF Pine Bluff ARKANSAS AR

 

That looks great! OK, I’m ready to clean up the whole data set:

 

data cfdq2.mailinglist_clean;
/* Use LENGTH to control variable length and logical order */
length Name $50 Address $30 City $20 State $2 Zip 8 Country $3;
/*Rename Country and State (need new lengths for cleaned values)*/
set cfdq2.mailinglist (rename=(Country=_c State=_s));
/* Drop variables with names that start with underscore */
drop _:;
/* Standardize values with the appropriate definition */
Name=DQSTANDARDIZE(Name,'Name', 'ENUSA');
Address=DQSTANDARDIZE(Address,'Address', 'ENUSA');
City=DQSTANDARDIZE(City,'City', 'ENUSA');
State=DQSTANDARDIZE(_s,'State/Province (Abbreviation)', 'ENUSA');
Country=DQSTANDARDIZE(_c,'Country (ISO 3 Char)', 'ENUSA');
run;

 

Results:

Obs Name Address City State Zip Code Country
1 Cat Thomas 3005 Brookside Dr Guin AL 35563 USA
2 John von Neumann 4026 Maple Ln Huntsville AL 35802 USA
3 Sam Tennyson 2715 Mulberry Ave Donaldson AR 71941 USA
4 Christy Curry 399 Masonic Hill Rd Little Rock AR 72212 USA
5 Wanda Smith 2190 Cedar St Pine Bluff AR 71601 USA
6 Jim Hoadley 1078 Davis Ave Concord CA 94520 USA
7 Calvin Wenzel 1007 Thompson Dr Dublin CA 94568 USA
8 Jennifer Ryan 4974 Water St Fremont CA 94539 USA
9 **bleep** Reyes 4819 Kerry Way Los Angeles CA 90017 USA
10 James Attwood 3238 Heritage Rd San Joaquin CA 93660 USA
11 Echo Kaye 3215 Thompson Dr San Leandro CA 94578 USA
12 Anna Christian 3323 Morgan St Fort Walton Beach FL 32548 USA
13 Linda Cheeks 3385 Rinehart Rd Miami FL 33128 USA
14 Agnes Yates 4334 Chestnut St Polk City FL 33868 USA
15 Regina Snider 3697 Rinehart Rd Sunrise FL 33323 USA
16 Nichole Robinson 622 Chestnut St Tampa FL 33610 USA
17 Kevin Ramirez 1813 Flint St Atlanta GA 30303 USA
18 Kris Carolan 3935 Lakeland Park Dr Duluth GA 30097 USA
19 Thomas Jenkins 2565 Pin Oak Dr Davenport IA 52803 USA
20 Michael Tillman 4854 Poplar St Calumet City IL 60409 USA

 

And my data is beautifully standardized, using just a few lines of code.

 

But wait! You say you have process data that is in CAS instead of the Compute Server? Well, no worries. The DATA step code you need is almost identical - you just have to make sure the DATA step reads from and writes to a SAS library that uses the CAS engine.

 

/* Connect to CAS & load the source data to the CASUSER caslib */
cas conn;
proc casutil sessref=conn;
load data=cfdq2.mailinglist
casout="mailinglist" outcaslib="casuser" replace;
list tables incaslib="casuser";
run;
/* Process in CAS with the DATA step */
/* Change the LIBREF to point to a CAS engine library (casuser)*/
libname casuser cas caslib="casuser" sessref=conn;
data casuser.mailinglist_clean;
/* Use LENGTH to control variable length and logical order */
length Name $50 Address $30 City $20 State $2 Zip 8 Country $3;
/*Rename Country and State (need new lengths for cleaned values)*/
set casuser.mailinglist (rename=(Country=_c State=_s));
/* Drop variables with names that start with underscore */
drop _:;
/* Standardize values with the appropriate definition */
Name=DQSTANDARDIZE(Name,'Name', 'ENUSA');
Address=DQSTANDARDIZE(Address,'Address', 'ENUSA');
City=DQSTANDARDIZE(City,'City', 'ENUSA');
State=DQSTANDARDIZE(_s,'State/Province (Abbreviation)', 'ENUSA');
Country=DQSTANDARDIZE(_c,'Country (ISO 3 Char)', 'ENUSA');
run;
 
Partial Log:
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
 
Success - our DATA step ran multi-threaded in CAS! So, there you have it - data standardization is made so much easier by SAS Viya's SAS Data Quality features. Until the next time, may the SAS be with you!
 

Read the rest of the series:

Part 1 – Fundamentals

Part 2 – Standardization - This one!

Part 3 – Fuzzy Matching with Match Codes

Part 4 – Extracting Data from Text Files

 

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎12-07-2023 08:52 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags