Coding for Data Quality in SAS Viya Part 4 – Extracting Data from Messy Text Files

4 Likes

I’ve frequently had to extract data from messy text files produced by a process beyond my control. There is valuable data in the file, but it's such a jumbled-up mess! Before I had access to SAS Data Quality, this would have been a daunting task. I'd have written a bunch of regular expressions and used several SAS functions to repeatedly process each line of text to extract my data.

Introduction:

Here's an example raw text file much like those I've encountered:

P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC
7211447436,ballwin, mo,j.rochford@example.com,JOSE ROCHFORD
7189220353,centerville, oh,c.lapp@example.com,CATHY LAPP
FAYETTEVILLE, ME,910-488-5752,AARON4@example.com,Richard Aaron DBA AARON LLC
2524925912,henderson, nc,d.abbott@example.com,DAVID B ABBOTT
BUTNER, NC,919-575-4862,ABBOTT7@example.com,Edward C Abbott DBA ABBOTT AND SONS
CLEMMONS, NC,336-766-5691,ABBOTT8@example.com,J Abbott DBA ABBOTT LLC
9193626118,apex, nc,k.abbott@example.com,KEN ABBOTT
WINSTON SALEM, NC,336-703-1141,FAWKY10@example.com,Mr. Abdallah Fawky DBA FAWKY AND SONS
M.Abdi@example.Com,mohammed j abdi,704.393.5726,Charlotte, NC
A.Phd@example.Com,abdul rahman, phd,919.870.9963,Raleigh, NC

In SAS Viya, I can use this simple two-step plan to extract and clean up the data:
1. Process the raw text using the dqExtract function to extract the Contact Information tokens and create a new variable containing the token values as delimited text. Then, I’ll use the dqExtTokenGet function to extract the tokens I want from the delimited text variable to populate the values for Name, Address, Email, and Phone variables.

2. Use the lowcase and dqStandardize functions to clean up the extracted values and produce my final, clean data set.

Step 1 - Extracting values from the text

Here’s the code I'll use for Step 1:

%let path=~/blogs/cfdq;
filename raw "&path/data/rawinfo.txt";

data Extracted;
	/* Set up new variables in the Program Data Vector (PDV) */
	length DelimitedText $ 200 Name $50 Company $50 Address $150 Email $50 Phone $14;
	drop delimitedtext;

	/* Designate the raw text file I want to read */
	infile raw;

	/* Read a raw text file record into the input buffer */
	input;
	
	/* Use DQEXTRACT to parse the input buffer using the 'Contact Info' 
	   extraction definition and produce a delimited text value containing 
	   the appropriate tokens */
	DelimitedText=dqExtract(_infile_,'Contact Info');

	/* Use DQEXTTOKENGET to extract the Organization token from the delimited text 
           using the 'Contact Info' extraction definition. */  
	Company=dqExtTokenGet(DelimitedText,'Organization','Contact Info'); 

	/* Continue using DQEXTTOKENGET to extract the Name, Address, E-Mail, and  
	   Phone token values from the delimited text */
	Name=dqExtTokenGet(DelimitedText,'Name','Contact Info'); /* Name token */
	Address=dqExtTokenGet(DelimitedText,'Address','Contact Info'); /* Address token */
	Email=dqExtTokenGet(DelimitedText,'E-mail','Contact Info'); /* E-mail token */
	Phone=dqExtTokenGet(DelimitedText,'Phone','Contact Info'); /* Phone token */
run;

For comparison, here are the first 3 rows of raw data:

P.Abbott@example.Com,paul j abbott,252.492.5915,Henderson, NC 
7211447436,ballwin, mo,j.rochford@example.com,JOSE ROCHFORD 
7189220353,centerville, oh,c.lapp@example.com,CATHY LAPP

And the first three rows of the output data set:

Name	Company	Address	Email	Phone
paul j abbott		Henderson, NC	P.Abbott@example.Com	252.492.5915
JOSE ROCHFORD		ballwin, mo	j.rochford@example.com	7211447436
CATHY LAPP		centerville, oh	c.lapp@example.com	7189220353

What an amazing transformation! No matter the order of the values in the text file, the dqExtract and dqTokenGet functions found and extracted the appropriate values with very little coding required on my part.

Step 2 - Cleaning and standardizing the results

To finish the job, I’ll standardize the values using the lowcase and dqStandardize functions. Here’s the code for Step 2:

data CleanContacts;
	set Extracted;

	/* DQSTANDARDIZE standardizes Company using the 'Organization' 
	   standardization definition */
	Company=dqStandardize(Company,'Organization');

	/* Use DQSTANDARDIZE to standardize Name and Phone */
	Name=dqStandardize(Name,'Name');
	Phone=dqStandardize(Phone,'Phone');

	/* DQTOKEN extracts City from Address using the 
	   'City - State/Province - Postal Code' parse definition, then use
	   DQSTANDARDIZE to standardize the results using the 'City' 
	   standardization definition */
	City=dqStandardize(
		  dqtoken(Address,'City','City - State/Province - Postal Code')
		 ,'City');

	/* DQTOKEN extracts State from Address using the 
	   'City - State/Province - Postal Code' parse definition, then 
	   DQSTANDARDIZE standardizes the results using the 
	   'State/Province (Abbreviation)' standardization definition */
	State=dqStandardize(
			dqtoken(Address,'State','City - State/Province - Postal Code')
				   ,'State/Province (Abbreviation)');

	/* LOWCASE standardizes Email */
	Email=lowcase(Email);
	/* Get rid of the original Address column */
	drop Address;
run;

A quick look at the first 3 observations shows the data is nicely cleaned and standardized:

Name	Company	Email	Phone	City	State
Paul J Abbott		p.abbott@example.com	(252) 492 5915	Henderson	NC
Jose Rochford		j.rochford@example.com	(721) 144 7436	Ballwin	MO
Cathy Lapp		c.lapp@example.com	(718) 922 0353	Centerville	OH

Step 3: Efficiency tuning

We now know the process works, but making two passes through the data would be inefficient, especially if the data is large. The entire process could have been accomplished in a single DATA step. That code looks like this:

data dq.CleanContacts;
	length DelimitedText $ 200 Name $50 Company $50 City $50 State $2
			 Phone $14 Email $50;
	drop delimitedtext;
	infile raw;
	input;
	DelimitedText=dqExtract(_infile_,'Contact Info');
	Company=dqStandardize(
	        dqExtTokenGet(DelimitedText,'Organization','Contact Info'),'Organization');
	Name=dqStandardize(
	     dqExtTokenGet(DelimitedText,'Name','Contact Info'),'Name');
	City=dqStandardize(
	     dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'City'
	            ,'City - State/Province - Postal Code'),'City');
	State=dqStandardize(
	      dqtoken(dqExtTokenGet(DelimitedText,'Address','Contact Info'),'State'
                  ,'City - State/Province - Postal Code'),'State/Province (Abbreviation)');
	Phone=dqStandardize(dqExtTokenGet(DelimitedText,'Phone','Contact Info'),'Phone');
	Email=lowcase(dqExtTokenGet(DelimitedText,'E-mail','Contact Info'));
run;

And, violá – beautifully clean production data in a single pass:

Name	Company	City	State	Phone	Email
Paul J Abbott		Henderson	NC	(252) 492 5915	p.abbott@example.com
Jose Rochford		Ballwin	MO	(721) 144 7436	j.rochford@example.com
Cathy Lapp		Centerville	OH	(718) 922 0353	c.lapp@example.com
Richard Aaron	Aaron LLC	Fayetteville	ME	(910) 488 5752	aaron4@example.com
David B Abbott		Henderson	NC	(252) 492 5912	d.abbott@example.com
Edward C Abbott	Abbott & Sons	Butner	NC	(919) 575 4862	abbott7@example.com
J Abbott	Abbott LLC	Clemmons	NC	(336) 766 5691	abbott8@example.com
Ken Abbott		Apex	NC	(919) 362 6118	k.abbott@example.com
Abdallah Fawky	Fawky & Sons	Winston Salem	NC	(336) 703 1141	fawky10@example.com
Mohammed J Abdi		Charlotte	NC	(704) 393 5726	m.abdi@example.com
Abdul Rahman, PhD		Raleigh	NC	(919) 870 9963	a.phd@example.com

Do you ever have to deal with messy data like this? What other challenges do you face that might be made easier with SAS Data Quality functions and procedures?

Grab the ZIP file containing the code and data for this blog series from my GitHub here

Links to prior posts in this series:

Find more articles from SAS Global Enablement and Learning here.