I’ve frequently had to extract data from messy text files produced by a process beyond my control. There is valuable data in the file, but it's such a jumbled-up mess! Before I had access to SAS Data Quality, this would have been a daunting task. I'd have written a bunch of regular expressions and used several SAS functions to repeatedly process each line of text to extract my data.
Here's an example raw text file much like those I've encountered:
P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC 7211447436,ballwin, mo,j.rochford@example.com,JOSE ROCHFORD 7189220353,centerville, oh,c.lapp@example.com,CATHY LAPP FAYETTEVILLE, ME,910-488-5752,AARON4@example.com,Richard Aaron DBA AARON LLC 2524925912,henderson, nc,d.abbott@example.com,DAVID B ABBOTT BUTNER, NC,919-575-4862,ABBOTT7@example.com,Edward C Abbott DBA ABBOTT AND SONS CLEMMONS, NC,336-766-5691,ABBOTT8@example.com,J Abbott DBA ABBOTT LLC 9193626118,apex, nc,k.abbott@example.com,KEN ABBOTT WINSTON SALEM, NC,336-703-1141,FAWKY10@example.com,Mr. Abdallah Fawky DBA FAWKY AND SONS M.Abdi@example.Com,mohammed j abdi,704.393.5726,Charlotte, NC A.Phd@example.Com,abdul rahman, phd,919.870.9963,Raleigh, NC
In SAS Viya, I can use this simple two-step plan to extract and clean up the data:
1. Process the raw text using the dqExtract function to extract the Contact Information tokens and create a new variable containing the token values as delimited text. Then, I’ll use the dqExtTokenGet function to extract the tokens I want from the delimited text variable to populate the values for Name, Address, Email, and Phone variables.
2. Use the lowcase and dqStandardize functions to clean up the extracted values and produce my final, clean data set.
Here’s the code I'll use for Step 1:
%let path=~/blogs/cfdq;
filename raw "&path/data/rawinfo.txt";
data Extracted;
/* Set up new variables in the Program Data Vector (PDV) */
length DelimitedText $ 200 Name $50 Company $50 Address $150 Email $50 Phone $14;
drop delimitedtext;
/* Designate the raw text file I want to read */
infile raw;
/* Read a raw text file record into the input buffer */
input;
/* Use DQEXTRACT to parse the input buffer using the 'Contact Info'
extraction definition and produce a delimited text value containing
the appropriate tokens */
DelimitedText=dqExtract(_infile_,'Contact Info');
/* Use DQEXTTOKENGET to extract the Organization token from the delimited text
using the 'Contact Info' extraction definition. */
Company=dqExtTokenGet(DelimitedText,'Organization','Contact Info');
/* Continue using DQEXTTOKENGET to extract the Name, Address, E-Mail, and
Phone token values from the delimited text */
Name=dqExtTokenGet(DelimitedText,'Name','Contact Info'); /* Name token */
Address=dqExtTokenGet(DelimitedText,'Address','Contact Info'); /* Address token */
Email=dqExtTokenGet(DelimitedText,'E-mail','Contact Info'); /* E-mail token */
Phone=dqExtTokenGet(DelimitedText,'Phone','Contact Info'); /* Phone token */
run;
For comparison, here are the first 3 rows of raw data:
P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC 7211447436,ballwin, mo,j.rochford@example.com,JOSE ROCHFORD 7189220353,centerville, oh,c.lapp@example.com,CATHY LAPP
And the first three rows of the output data set:
Name | Company | Address | Phone | |
paul j abbott | Henderson, NC | P.Abbott@example.Com | 252.492.5915 | |
JOSE ROCHFORD | ballwin, mo | j.rochford@example.com | 7211447436 | |
CATHY LAPP | centerville, oh | c.lapp@example.com | 7189220353 |
What an amazing transformation! No matter the order of the values in the text file, the dqExtract and dqTokenGet functions found and extracted the appropriate values with very little coding required on my part.
To finish the job, I’ll standardize the values using the lowcase and dqStandardize functions. Here’s the code for Step 2:
data CleanContacts;
set Extracted;
/* DQSTANDARDIZE standardizes Company using the 'Organization'
standardization definition */
Company=dqStandardize(Company,'Organization');
/* Use DQSTANDARDIZE to standardize Name and Phone */
Name=dqStandardize(Name,'Name');
Phone=dqStandardize(Phone,'Phone');
/* DQTOKEN extracts City from Address using the
'City - State/Province - Postal Code' parse definition, then use
DQSTANDARDIZE to standardize the results using the 'City'
standardization definition */
City=dqStandardize(
dqtoken(Address,'City','City - State/Province - Postal Code')
,'City');
/* DQTOKEN extracts State from Address using the
'City - State/Province - Postal Code' parse definition, then
DQSTANDARDIZE standardizes the results using the
'State/Province (Abbreviation)' standardization definition */
State=dqStandardize(
dqtoken(Address,'State','City - State/Province - Postal Code')
,'State/Province (Abbreviation)');
/* LOWCASE standardizes Email */
Email=lowcase(Email);
/* Get rid of the original Address column */
drop Address;
run;
A quick look at the first 3 observations shows the data is nicely cleaned and standardized:
Name | Company | Phone | City | State | |
Paul J Abbott | p.abbott@example.com | (252) 492 5915 | Henderson | NC | |
Jose Rochford | j.rochford@example.com | (721) 144 7436 | Ballwin | MO | |
Cathy Lapp | c.lapp@example.com | (718) 922 0353 | Centerville | OH |
We now know the process works, but making two passes through the data would be inefficient, especially if the data is large. The entire process could have been accomplished in a single DATA step. That code looks like this:
data dq.CleanContacts;
length DelimitedText $ 200 Name $50 Company $50 City $50 State $2
Phone $14 Email $50;
drop delimitedtext;
infile raw;
input;
DelimitedText=dqExtract(_infile_,'Contact Info');
Company=dqStandardize(
dqExtTokenGet(DelimitedText,'Organization','Contact Info'),'Organization');
Name=dqStandardize(
dqExtTokenGet(DelimitedText,'Name','Contact Info'),'Name');
City=dqStandardize(
dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'City'
,'City - State/Province - Postal Code'),'City');
State=dqStandardize(
dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'State'
,'City - State/Province - Postal Code'),'State/Province (Abbreviation)');
Phone=dqStandardize(dqExtTokenGet(DelimitedText,'Phone','Contact Info'),'Phone');
Email=lowcase(dqExtTokenGet(DelimitedText,'E-mail','Contact Info'));
run;
And, violá – beautifully clean production data in a single pass:
Name | Company | City | State | Phone | |
Paul J Abbott | Henderson | NC | (252) 492 5915 | p.abbott@example.com | |
Jose Rochford | Ballwin | MO | (721) 144 7436 | j.rochford@example.com | |
Cathy Lapp | Centerville | OH | (718) 922 0353 | c.lapp@example.com | |
Richard Aaron | Aaron LLC | Fayetteville | ME | (910) 488 5752 | aaron4@example.com |
David B Abbott | Henderson | NC | (252) 492 5912 | d.abbott@example.com | |
Edward C Abbott | Abbott & Sons | Butner | NC | (919) 575 4862 | abbott7@example.com |
J Abbott | Abbott LLC | Clemmons | NC | (336) 766 5691 | abbott8@example.com |
Ken Abbott | Apex | NC | (919) 362 6118 | k.abbott@example.com | |
Abdallah Fawky | Fawky & Sons | Winston Salem | NC | (336) 703 1141 | fawky10@example.com |
Mohammed J Abdi | Charlotte | NC | (704) 393 5726 | m.abdi@example.com | |
Abdul Rahman, PhD | Raleigh | NC | (919) 870 9963 | a.phd@example.com |
Do you ever have to deal with messy data like this? What other challenges do you face that might be made easier with SAS Data Quality functions and procedures?
Grab the ZIP file containing the code and data for this blog series from my GitHub here
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.