BookmarkSubscribeRSS Feed

Coding for Data Quality in SAS Viya Part 4 – Extracting Data from Messy Text Files

Started ‎11-30-2023 by
Modified ‎12-07-2023 by
Views 900

I’ve frequently had to extract data from messy text files produced by a process beyond my control. There is valuable data in the file, but it's such a jumbled-up mess! Before I had access to SAS Data Quality, this would have been a daunting task. I'd have written a bunch of regular expressions and used several SAS functions to repeatedly process each line of text to extract my data.



Here's an example raw text file much like those I've encountered:

P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC
7211447436,ballwin, mo,,JOSE ROCHFORD
7189220353,centerville, oh,,CATHY LAPP
FAYETTEVILLE, ME,910-488-5752,,Richard Aaron DBA AARON LLC
2524925912,henderson, nc,,DAVID B ABBOTT
BUTNER, NC,919-575-4862,,Edward C Abbott DBA ABBOTT AND SONS
CLEMMONS, NC,336-766-5691,,J Abbott DBA ABBOTT LLC
9193626118,apex, nc,,KEN ABBOTT
WINSTON SALEM, NC,336-703-1141,,Mr. Abdallah Fawky DBA FAWKY AND SONS
M.Abdi@example.Com,mohammed j abdi,704.393.5726,Charlotte, NC
A.Phd@example.Com,abdul rahman, phd,919.870.9963,Raleigh, NC


In SAS Viya, I can use this simple two-step plan to extract and clean up the data:
1. Process the raw text using the dqExtract function to extract the Contact Information tokens and create a new variable containing the token values as delimited text. Then, I’ll use the dqExtTokenGet function to extract the tokens I want from the delimited text variable to populate the values for Name, Address, Email, and Phone variables.

2. Use the lowcase and dqStandardize functions to clean up the extracted values and produce my final, clean data set.


Step 1 - Extracting values from the text

Here’s the code I'll use for Step 1:

%let path=~/blogs/cfdq;
filename raw "&path/data/rawinfo.txt";

data Extracted;
	/* Set up new variables in the Program Data Vector (PDV) */
	length DelimitedText $ 200 Name $50 Company $50 Address $150 Email $50 Phone $14;
	drop delimitedtext;

	/* Designate the raw text file I want to read */
	infile raw;

	/* Read a raw text file record into the input buffer */
	/* Use DQEXTRACT to parse the input buffer using the 'Contact Info' 
	   extraction definition and produce a delimited text value containing 
	   the appropriate tokens */
	DelimitedText=dqExtract(_infile_,'Contact Info');

	/* Use DQEXTTOKENGET to extract the Organization token from the delimited text 
           using the 'Contact Info' extraction definition. */  
	Company=dqExtTokenGet(DelimitedText,'Organization','Contact Info'); 

	/* Continue using DQEXTTOKENGET to extract the Name, Address, E-Mail, and  
	   Phone token values from the delimited text */
	Name=dqExtTokenGet(DelimitedText,'Name','Contact Info'); /* Name token */
	Address=dqExtTokenGet(DelimitedText,'Address','Contact Info'); /* Address token */
	Email=dqExtTokenGet(DelimitedText,'E-mail','Contact Info'); /* E-mail token */
	Phone=dqExtTokenGet(DelimitedText,'Phone','Contact Info'); /* Phone token */


For comparison, here are the first 3 rows of raw data:

P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC 
7211447436,ballwin, mo,,JOSE ROCHFORD 
7189220353,centerville, oh,,CATHY LAPP


And the first three rows of the output data set:

Name Company Address Email Phone
paul j abbott   Henderson, NC P.Abbott@example.Com 252.492.5915
JOSE ROCHFORD   ballwin, mo 7211447436
CATHY LAPP   centerville, oh 7189220353


What an amazing transformation! No matter the order of the values in the text file, the dqExtract and dqTokenGet functions found and extracted the appropriate values with very little coding required on my part.


Step 2 - Cleaning and standardizing the results

To finish the job, I’ll standardize the values using the lowcase and dqStandardize functions. Here’s the code for Step 2:

data CleanContacts;
	set Extracted;

	/* DQSTANDARDIZE standardizes Company using the 'Organization' 
	   standardization definition */

	/* Use DQSTANDARDIZE to standardize Name and Phone */

	/* DQTOKEN extracts City from Address using the 
	   'City - State/Province - Postal Code' parse definition, then use
	   DQSTANDARDIZE to standardize the results using the 'City' 
	   standardization definition */
		  dqtoken(Address,'City','City - State/Province - Postal Code')

	/* DQTOKEN extracts State from Address using the 
	   'City - State/Province - Postal Code' parse definition, then 
	   DQSTANDARDIZE standardizes the results using the 
	   'State/Province (Abbreviation)' standardization definition */
			dqtoken(Address,'State','City - State/Province - Postal Code')
				   ,'State/Province (Abbreviation)');

	/* LOWCASE standardizes Email */
	/* Get rid of the original Address column */
	drop Address;


A quick look at the first 3 observations shows the data is nicely cleaned and standardized:


Name Company Email Phone City State
Paul J Abbott (252) 492 5915 Henderson NC
Jose Rochford (721) 144 7436 Ballwin MO
Cathy Lapp (718) 922 0353 Centerville OH


Step 3: Efficiency tuning

We now know the process works, but making two passes through the data would be inefficient, especially if the data is large. The entire process could have been accomplished in a single DATA step. That code looks like this:

data dq.CleanContacts;
	length DelimitedText $ 200 Name $50 Company $50 City $50 State $2
			 Phone $14 Email $50;
	drop delimitedtext;
	infile raw;
	DelimitedText=dqExtract(_infile_,'Contact Info');
	        dqExtTokenGet(DelimitedText,'Organization','Contact Info'),'Organization');
	     dqExtTokenGet(DelimitedText,'Name','Contact Info'),'Name');
	     dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'City'
	            ,'City - State/Province - Postal Code'),'City');
	      dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'State'
                  ,'City - State/Province - Postal Code'),'State/Province (Abbreviation)');
	Phone=dqStandardize(dqExtTokenGet(DelimitedText,'Phone','Contact Info'),'Phone');
	Email=lowcase(dqExtTokenGet(DelimitedText,'E-mail','Contact Info'));


And, violá – beautifully clean production data in a single pass:

Name Company City State Phone Email
Paul J Abbott   Henderson NC (252) 492 5915
Jose Rochford   Ballwin MO (721) 144 7436
Cathy Lapp   Centerville OH (718) 922 0353
Richard Aaron Aaron LLC Fayetteville ME (910) 488 5752
David B Abbott   Henderson NC (252) 492 5912
Edward C Abbott Abbott & Sons Butner NC (919) 575 4862
J Abbott Abbott LLC Clemmons NC (336) 766 5691
Ken Abbott   Apex NC (919) 362 6118
Abdallah Fawky Fawky & Sons Winston Salem NC (336) 703 1141
Mohammed J Abdi   Charlotte NC (704) 393 5726
Abdul Rahman, PhD   Raleigh NC (919) 870 9963


Do you ever have to deal with messy data like this? What other challenges do you face that might be made easier with SAS Data Quality functions and procedures?

Grab the ZIP file containing the code and data for this blog series from my GitHub here 


Links to prior posts in this series:

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎12-07-2023 08:51 AM
Updated by:




The early bird rate has been extended! Register by March 18 for just $695 - $100 off the standard rate.


Check out the agenda and get ready for a jam-packed event featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events. 


Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags