In my long SAS programming career, I’ve found the 80-20 rule applies when it comes to working with data: 80% of your time is spent finding, cleaning, munging, and otherwise prepping the data, and the other 20% is spent doing the actual reporting, analysis, etc. The SAS Data Quality features in SAS Viya make that "80%" part much easier and faster. After acquiring some data, often the first step required is to standardize text values to make munging and reporting easier. This might involve rearranging words, like re-writing individual’s names in first name, last name order, changing individual words to abbreviations or vice-versa, and applying complex casing rules. The transformations applied to text often vary with the semantic type of the data. For example, if the string "Virginia" is a US state name, then you probably want it transformed to "VA". But if it’s an individual’s name instead, for example, "DOE, VIRGINIA", then we would want "Virginia Doe" as the result instead. And speaking of casing – the SAS PROPCASE function is excellent, but how do you code it to handle ‘McDonald’ or ‘von Trapp’? Fortunately, there are SAS Data Quality standardization definitions tailored for specific semantic types of the input text which can apply complex, pre-defined rules to standardize values with very little code required.
First, let’s have a peek at the data we’ve got to work with:
Row | Name | Address | City | State | Zip Code |
1 | CAT THOMAS | 3005 BROOKSIDE DRIVE | GUIN | ALABAMA | 35563 |
2 | VON NEUMANN, JOHN | 4026 MAPLE LANE | HUNTSVILLE | ALABAMA | 35802 |
3 | SAM TENNYSON | 2715 MULBERRY AVE | DONALDSON | ARKANSAS | 71941 |
4 | CURRY, CHRISTY | 399 MASONIC HILL ROAD | LITTLE ROCK | ARKANSAS | 72212 |
5 | WANDA SMITH | 2190 CEDAR ST | PINE BLUFF | ARKANSAS | 71601 |
I notice that the values are all upper case. Other problems include having people’s names entered inconsistently, addresses that don’t use standard abbreviations, and the State column values include the whole state name instead of just the two-letter postal code. This data could really use some standardization!
Before we start, let’s check the existing Data Quality System options.
proc options group=dataquality;
run;
Partial Log:
DQLOCALE=(ENUSA)
Specifies the Data Quality Server ordered list of locales for data cleansing.
DQSETUPLOC=QKB CI 33
Specifies the location of the Quality Knowledge Base root directory.
I’ll use the %DQLOAD autocall macro to make the locales I want to work with available in memory.
%DQLOAD(DQSETUPLOC="QKB CI 33", DQLOCALE=(ENUSA PTBRA), DQINFO=1);
Partial Log:
DQINFO: DQLOCALE parm that was passed in to the DQLOAD
DQINFO: macro: (ENUSA PTBRA)
DQINFO: DQSETUPLOC parm that was passed in to the DQLOAD
DQINFO: macro: “QKB CI 33”
DQINFO: Need to set the DQLOCALE option
…
DQINFO: Before = (ENUSA)
DQINFO: After = (ENUSA PTBRA)
DQINFO: Need to set the DQSETUPLOC option
…
DQINFO: Before = QKB CI 33
DQINFO: After = QKB CI 33
DQINFO: Need to load the locale(s)
…
DQINFO: Done loading locale(s).
Next, I’ll use the DQLOCALEINFOLIST function to get the names of the standardization definitions available in the ENUSA locale.
data _null_;
rc=DQLOCALEINFOLIST('STANDARDIZATION', 'ENUSA');
run;
The list is long! For now, I’ll focus on the ones associated with names and addresses.
Partial Log:
OPERATION: Definition Name
—————————
STANDARDIZATION: Address
…
STANDARDIZATION: City
STANDARDIZATION: City – State/Province – Postal Code
…
STANDARDIZATION: Country
STANDARDIZATION: Country (ISO 2 Char)
STANDARDIZATION: Country (ISO 3 Char)
…
STANDARDIZATION: Name
…
STANDARDIZATION: State/Province (Abbreviation)
STANDARDIZATION: State/Province (Full Name)
We can use the DQSTANDARDIZE function to apply standardization definitions to our data for quick and easy cleanup. Let’s use the Name definition on the values in our Name column :
proc FedSQL number;
select Name
,DQSTANDARDIZE(Name,'Name', 'ENUSA') as StdName
from cfdq2.mailinglist limit 5;
quit;
Results:
Row | Name | STDNAME |
1 | CAT THOMAS | Cat Thomas |
2 | VON NEUMANN, JOHN | John von Neumann |
3 | SAM TENNYSON | Sam Tennyson |
4 | CURRY, CHRISTY | Christy Curry |
5 | WANDA SMITH | Wanda Smith |
Wow - that fixed the casing and put the values in First Name, Last Name format, all at once! I wonder how well the Address definition will work on our Address column values.
proc FedSQL number;
select Address
,DQSTANDARDIZE(Address,'Address', 'ENUSA') as StdAddress
from cfdq2.mailinglist limit 5;
quit;
Results:
Row | Address | STDADDRESS |
1 | 3005 BROOKSIDE DRIVE | 3005 Brookside Dr |
2 | 4026 MAPLE LANE | 4026 Maple Ln |
3 | 2715 MULBERRY AVE | 2715 Mulberry Ave |
4 | 399 MASONIC HILL ROAD | 399 Masonic Hill Rd |
5 | 2190 CEDAR ST | 2190 Cedar St |
Very nice! Everything is properly cased, and standard abbreviations were applied for lane, drive, avenue, etc. Finally, let's test using the City definition to clean up the City values, and the 'State/Province (Abbreviation)’ to produce the appropriate two-letter state abbreviations from the State values.
proc FedSQL number;
select City
,DQSTANDARDIZE(City,'City', 'ENUSA') as StdCity
,State
,DQSTANDARDIZE(State,'State/Province (Abbreviation)', 'ENUSA') as StdState
from cfdq2.mailinglist limit 5;
quit;
Results:
Row | City | STDCITY | State | STDSTATE |
1 | GUIN | Guin | ALABAMA | AL |
2 | HUNTSVILLE | Huntsville | ALABAMA | AL |
3 | DONALDSON | Donaldson | ARKANSAS | AR |
4 | LITTLE ROCK | Little Rock | ARKANSAS | AR |
5 | PINE BLUFF | Pine Bluff | ARKANSAS | AR |
That looks great! OK, I’m ready to clean up the whole data set:
data cfdq2.mailinglist_clean;
/* Use LENGTH to control variable length and logical order */
length Name $50 Address $30 City $20 State $2 Zip 8 Country $3;
/*Rename Country and State (need new lengths for cleaned values)*/
set cfdq2.mailinglist (rename=(Country=_c State=_s));
/* Drop variables with names that start with underscore */
drop _:;
/* Standardize values with the appropriate definition */
Name=DQSTANDARDIZE(Name,'Name', 'ENUSA');
Address=DQSTANDARDIZE(Address,'Address', 'ENUSA');
City=DQSTANDARDIZE(City,'City', 'ENUSA');
State=DQSTANDARDIZE(_s,'State/Province (Abbreviation)', 'ENUSA');
Country=DQSTANDARDIZE(_c,'Country (ISO 3 Char)', 'ENUSA');
run;
Results:
Obs | Name | Address | City | State | Zip Code | Country |
1 | Cat Thomas | 3005 Brookside Dr | Guin | AL | 35563 | USA |
2 | John von Neumann | 4026 Maple Ln | Huntsville | AL | 35802 | USA |
3 | Sam Tennyson | 2715 Mulberry Ave | Donaldson | AR | 71941 | USA |
4 | Christy Curry | 399 Masonic Hill Rd | Little Rock | AR | 72212 | USA |
5 | Wanda Smith | 2190 Cedar St | Pine Bluff | AR | 71601 | USA |
6 | Jim Hoadley | 1078 Davis Ave | Concord | CA | 94520 | USA |
7 | Calvin Wenzel | 1007 Thompson Dr | Dublin | CA | 94568 | USA |
8 | Jennifer Ryan | 4974 Water St | Fremont | CA | 94539 | USA |
9 | **bleep** Reyes | 4819 Kerry Way | Los Angeles | CA | 90017 | USA |
10 | James Attwood | 3238 Heritage Rd | San Joaquin | CA | 93660 | USA |
11 | Echo Kaye | 3215 Thompson Dr | San Leandro | CA | 94578 | USA |
12 | Anna Christian | 3323 Morgan St | Fort Walton Beach | FL | 32548 | USA |
13 | Linda Cheeks | 3385 Rinehart Rd | Miami | FL | 33128 | USA |
14 | Agnes Yates | 4334 Chestnut St | Polk City | FL | 33868 | USA |
15 | Regina Snider | 3697 Rinehart Rd | Sunrise | FL | 33323 | USA |
16 | Nichole Robinson | 622 Chestnut St | Tampa | FL | 33610 | USA |
17 | Kevin Ramirez | 1813 Flint St | Atlanta | GA | 30303 | USA |
18 | Kris Carolan | 3935 Lakeland Park Dr | Duluth | GA | 30097 | USA |
19 | Thomas Jenkins | 2565 Pin Oak Dr | Davenport | IA | 52803 | USA |
20 | Michael Tillman | 4854 Poplar St | Calumet City | IL | 60409 | USA |
And my data is beautifully standardized, using just a few lines of code.
But wait! You say you have process data that is in CAS instead of the Compute Server? Well, no worries. The DATA step code you need is almost identical - you just have to make sure the DATA step reads from and writes to a SAS library that uses the CAS engine.
/* Connect to CAS & load the source data to the CASUSER caslib */
cas conn;
proc casutil sessref=conn;
load data=cfdq2.mailinglist
casout="mailinglist" outcaslib="casuser" replace;
list tables incaslib="casuser";
run;
/* Process in CAS with the DATA step */
/* Change the LIBREF to point to a CAS engine library (casuser)*/
libname casuser cas caslib="casuser" sessref=conn;
data casuser.mailinglist_clean;
/* Use LENGTH to control variable length and logical order */
length Name $50 Address $30 City $20 State $2 Zip 8 Country $3;
/*Rename Country and State (need new lengths for cleaned values)*/
set casuser.mailinglist (rename=(Country=_c State=_s));
/* Drop variables with names that start with underscore */
drop _:;
/* Standardize values with the appropriate definition */
Name=DQSTANDARDIZE(Name,'Name', 'ENUSA');
Address=DQSTANDARDIZE(Address,'Address', 'ENUSA');
City=DQSTANDARDIZE(City,'City', 'ENUSA');
State=DQSTANDARDIZE(_s,'State/Province (Abbreviation)', 'ENUSA');
Country=DQSTANDARDIZE(_c,'Country (ISO 3 Char)', 'ENUSA');
run;
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
Part 2 – Standardization - This one!
Part 3 – Fuzzy Matching with Match Codes
Part 4 – Extracting Data from Text Files
Find more articles from SAS Global Enablement and Learning here.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.