I’ve frequently had to extract data from messy text files produced by a process beyond my control. There is valuable data in the file, but it's such a jumbled-up mess! Before I had access to SAS Data Quality, this would have been a daunting task. I'd have written a bunch of regular expressions and used several SAS functions to repeatedly process each line of text to extract my data.
Introduction:
Here's an example raw text file much like those I've encountered:
P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC
7211447436,ballwin, mo,j.rochford@example.com,JOSE ROCHFORD
7189220353,centerville, oh,c.lapp@example.com,CATHY LAPP
FAYETTEVILLE, ME,910-488-5752,AARON4@example.com,Richard Aaron DBA AARON LLC
2524925912,henderson, nc,d.abbott@example.com,DAVID B ABBOTT
BUTNER, NC,919-575-4862,ABBOTT7@example.com,Edward C Abbott DBA ABBOTT AND SONS
CLEMMONS, NC,336-766-5691,ABBOTT8@example.com,J Abbott DBA ABBOTT LLC
9193626118,apex, nc,k.abbott@example.com,KEN ABBOTT
WINSTON SALEM, NC,336-703-1141,FAWKY10@example.com,Mr. Abdallah Fawky DBA FAWKY AND SONS
M.Abdi@example.Com,mohammed j abdi,704.393.5726,Charlotte, NC
A.Phd@example.Com,abdul rahman, phd,919.870.9963,Raleigh, NC
In SAS Viya, I can use this simple two-step plan to extract and clean up the data: 1. Process the raw text using the dqExtract function to extract the Contact Information tokens and create a new variable containing the token values as delimited text. Then, I’ll use the dqExtTokenGet function to extract the tokens I want from the delimited text variable to populate the values for Name, Address, Email, and Phone variables.
2. Use the lowcase and dqStandardize functions to clean up the extracted values and produce my final, clean data set.
Step 1 - Extracting values from the text
Here’s the code I'll use for Step 1:
%let path=~/blogs/cfdq;
filename raw "&path/data/rawinfo.txt";
data Extracted;
/* Set up new variables in the Program Data Vector (PDV) */
length DelimitedText $ 200 Name $50 Company $50 Address $150 Email $50 Phone $14;
drop delimitedtext;
/* Designate the raw text file I want to read */
infile raw;
/* Read a raw text file record into the input buffer */
input;
/* Use DQEXTRACT to parse the input buffer using the 'Contact Info'
extraction definition and produce a delimited text value containing
the appropriate tokens */
DelimitedText=dqExtract(_infile_,'Contact Info');
/* Use DQEXTTOKENGET to extract the Organization token from the delimited text
using the 'Contact Info' extraction definition. */
Company=dqExtTokenGet(DelimitedText,'Organization','Contact Info');
/* Continue using DQEXTTOKENGET to extract the Name, Address, E-Mail, and
Phone token values from the delimited text */
Name=dqExtTokenGet(DelimitedText,'Name','Contact Info'); /* Name token */
Address=dqExtTokenGet(DelimitedText,'Address','Contact Info'); /* Address token */
Email=dqExtTokenGet(DelimitedText,'E-mail','Contact Info'); /* E-mail token */
Phone=dqExtTokenGet(DelimitedText,'Phone','Contact Info'); /* Phone token */
run;
For comparison, here are the first 3 rows of raw data:
P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC
7211447436,ballwin, mo,j.rochford@example.com,JOSE ROCHFORD
7189220353,centerville, oh,c.lapp@example.com,CATHY LAPP
And the first three rows of the output data set:
Name
Company
Address
Email
Phone
paul j abbott
Henderson, NC
P.Abbott@example.Com
252.492.5915
JOSE ROCHFORD
ballwin, mo
j.rochford@example.com
7211447436
CATHY LAPP
centerville, oh
c.lapp@example.com
7189220353
What an amazing transformation! No matter the order of the values in the text file, the dqExtract and dqTokenGet functions found and extracted the appropriate values with very little coding required on my part.
Step 2 - Cleaning and standardizing the results
To finish the job, I’ll standardize the values using the lowcase and dqStandardize functions. Here’s the code for Step 2:
data CleanContacts;
set Extracted;
/* DQSTANDARDIZE standardizes Company using the 'Organization'
standardization definition */
Company=dqStandardize(Company,'Organization');
/* Use DQSTANDARDIZE to standardize Name and Phone */
Name=dqStandardize(Name,'Name');
Phone=dqStandardize(Phone,'Phone');
/* DQTOKEN extracts City from Address using the
'City - State/Province - Postal Code' parse definition, then use
DQSTANDARDIZE to standardize the results using the 'City'
standardization definition */
City=dqStandardize(
dqtoken(Address,'City','City - State/Province - Postal Code')
,'City');
/* DQTOKEN extracts State from Address using the
'City - State/Province - Postal Code' parse definition, then
DQSTANDARDIZE standardizes the results using the
'State/Province (Abbreviation)' standardization definition */
State=dqStandardize(
dqtoken(Address,'State','City - State/Province - Postal Code')
,'State/Province (Abbreviation)');
/* LOWCASE standardizes Email */
Email=lowcase(Email);
/* Get rid of the original Address column */
drop Address;
run;
A quick look at the first 3 observations shows the data is nicely cleaned and standardized:
Name
Company
Email
Phone
City
State
Paul J Abbott
p.abbott@example.com
(252) 492 5915
Henderson
NC
Jose Rochford
j.rochford@example.com
(721) 144 7436
Ballwin
MO
Cathy Lapp
c.lapp@example.com
(718) 922 0353
Centerville
OH
Step 3: Efficiency tuning
We now know the process works, but making two passes through the data would be inefficient, especially if the data is large. The entire process could have been accomplished in a single DATA step. That code looks like this:
data dq.CleanContacts;
length DelimitedText $ 200 Name $50 Company $50 City $50 State $2
Phone $14 Email $50;
drop delimitedtext;
infile raw;
input;
DelimitedText=dqExtract(_infile_,'Contact Info');
Company=dqStandardize(
dqExtTokenGet(DelimitedText,'Organization','Contact Info'),'Organization');
Name=dqStandardize(
dqExtTokenGet(DelimitedText,'Name','Contact Info'),'Name');
City=dqStandardize(
dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'City'
,'City - State/Province - Postal Code'),'City');
State=dqStandardize(
dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'State'
,'City - State/Province - Postal Code'),'State/Province (Abbreviation)');
Phone=dqStandardize(dqExtTokenGet(DelimitedText,'Phone','Contact Info'),'Phone');
Email=lowcase(dqExtTokenGet(DelimitedText,'E-mail','Contact Info'));
run;
And, violá – beautifully clean production data in a single pass:
Name
Company
City
State
Phone
Email
Paul J Abbott
Henderson
NC
(252) 492 5915
p.abbott@example.com
Jose Rochford
Ballwin
MO
(721) 144 7436
j.rochford@example.com
Cathy Lapp
Centerville
OH
(718) 922 0353
c.lapp@example.com
Richard Aaron
Aaron LLC
Fayetteville
ME
(910) 488 5752
aaron4@example.com
David B Abbott
Henderson
NC
(252) 492 5912
d.abbott@example.com
Edward C Abbott
Abbott & Sons
Butner
NC
(919) 575 4862
abbott7@example.com
J Abbott
Abbott LLC
Clemmons
NC
(336) 766 5691
abbott8@example.com
Ken Abbott
Apex
NC
(919) 362 6118
k.abbott@example.com
Abdallah Fawky
Fawky & Sons
Winston Salem
NC
(336) 703 1141
fawky10@example.com
Mohammed J Abdi
Charlotte
NC
(704) 393 5726
m.abdi@example.com
Abdul Rahman, PhD
Raleigh
NC
(919) 870 9963
a.phd@example.com
Do you ever have to deal with messy data like this? What other challenges do you face that might be made easier with SAS Data Quality functions and procedures?
Grab the ZIP file containing the code and data for this blog series from my GitHub here
Links to prior posts in this series:
Part 1 – Fundamentals
Part 2 – Standardization
Part 3 – Fuzzy Matching with Match Codes
Find more articles from SAS Global Enablement and Learning here.
... View more