BookmarkSubscribeRSS Feed

Coding for Data Quality in SAS Viya Part 3 – Matching

Started ‎10-26-2023 by
Modified ‎12-07-2023 by
Views 1,623

Computers demand standardization and precision. So, processing text data with a computer is dicey at best. The more humans get involved in entering or generating text data, the more problematic it can be. Humans can clearly see that John Q. Publix of Williamsburg, Virginia is the same as Jon Quincey Public of Wmbg, VA. But a computer will have none of that – it knows full well that those two text strings are not the same! SAS Data Quality match codes can provide a quick and easy solution.

 

Let’s imagine you own a company that sells cat food. To increase sales, you intend to solicit new customers by mailing them a coupon for a free sample. You decide to buy the Cat Lover’s Magazine mailing list as a resource because mailings are expensive, and sending random solicitations to folks who don’t own a cat is a waste of resources. But it’s wasteful to solicit existing customers, too. How can we compare the mailing list to our existing customer list so that we can solicit just the new people? Here is a sample of the data:

 

Customers

 

Name Street Address City State
Catherine M. Thomas 3005 Brookside Drive Guin AL
János L. von Neumann 4026 Maple Lane Huntsville AL
Sam C. Tennyson 2715 Mulberry Avenue Donaldson AR
Kristi R. Curry 399 Masonic Hill Road Little Rock AR
Wanda R. Smith 2190 Cedar Street Pine Bluff AR

 

Cat Lovers Mailing List

 

Name Address City State
BEAUTIFUL KITTY 3005 BROOKSIDE DR GUIN AL
JOHN VON NEUMANN 4026 MAPLE LN HUNTSVILLE AL
SAM CHARLES 3750 MULBERRY AVE GROTON CT
CHRISTY CURRY 399 MASONIC HILL RD LITTLE ROCK AR
TIGER MAN 2190 CEDAR ST PINE BLUFF AR

 

The computer wouldn’t match any of those rows, yet I can see some matches:

  1. Customer Catherine Thomas has subscribed her cat, Beautiful Kitty.
  2. Customer János von Neumann is subscribed as John von Neumann.
  3. Customer Kristi R. Curry and subscriber Christy Curry are probably the same person.
  4. Customer Wanda Smith has also subscribed her cat, Tiger Man.

 

I think I should exclude names from the matching criteria. I’ll match households, using address, city, and state as the keys. But the mailing list is all upper case and uses standard abbreviations like “ST” for “Street” and “AVE” for “Avenue”. My customer data is in proper case and spells out those words. This could be a coding nightmare. SAS Data Quality to the rescue!

 

I’d like to use the ‘Address’ and ‘City - State/Province - Postal Code’ match codes to ensure I select only those Cat Lovers Magazine subscribers who are not in my customer data. What does a match code look like? This code will give me a peek:

 

proc SQL;
title "Address Match Codes";
select 'Customer List' as Source
       ,Street1,dqmatch(Street1,'Address') 'Match Code'
   from dq.customers (obs=1)
   union all
select 'Cat Lovers',Address,dqmatch(Address,'Address')
   from dq.CatLovers (obs=1);
title "City/State Match Codes";
select 'Customer List' as Source,City, State
       ,dqmatch(CATX(', ',City, State)
       ,'City - State/Province - Postal Code') 'Match Code'
   from dq.customers (obs=1)
   union all
select 'Cat Lovers',City, State
       ,dqmatch(CATX(', ',City, State)
       ,'City - State/Province - Postal Code') 
   from dq.CatLovers (obs=1);
quit;

 

Results:

 

 Address Match Codes

 

 Source Street Address Match Code
Customer List 3005 Brookside Drive $$$$$$$$$$$$$$K005$$$$$$$$$$MY&34&~ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Cat Lovers 3005 BROOKSIDE DR $$$$$$$$$$$$$$K005$$$$$$$$$$MY&34&~ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

 

City/State Match Codes

 

Source City State Match Code
Customer List Guin AL FP$$$$$&W$$$$$$
Cat Lovers GUIN AL FP$$$$$&W$$$$$$

 

Well, that should make this easy! I won’t have to fiddle with the case or spelling of the text values - I can use the match codes and leave the data as-is. This would be easier if every row already included a match code, though. PROC DQMATCH can accomplish that pretty quickly:

 

proc dqmatch data=dq.customers out=work.customers(keep=CustomerID 
                              Name Street1 City State Zip MatchID)
   matchcode=matchID  locale='ENUSA';
   criteria var=Street1 matchdef='Address' ;
   criteria var=City matchdef='City' ;
   criteria var=State matchdef='City - State/Province - Postal Code';
run;

proc dqmatch data=dq.catlovers out=work.catlovers(keep=Name Address
                                            City State Zip MatchID)
   matchcode=matchID  locale='ENUSA';
   criteria var=Address matchdef='Address';
   criteria var=City matchdef='City';
   criteria var=State matchdef='City - State/Province - Postal Code';
run;

Finally, a join on the MatchID column should yield my coupon mailing list:

 

proc FedSQL;
select cl.Name,cl.Address,cl.City,cl.State,cl.Zip 
from work.catlovers as cl
left join work.customers as c
on c.MatchID=cl.MatchID
where CustomerID is null;
quit;

Results:

 

Name Address City State Zip Code
KRIS RENASCIMENTO 2841 LAKELAND PARK DR GROTON CT 06340
SAM CHARLES 3750 MULBERRY AVE GROTON CT 06340
DICK INDICADO 3764 KERRY WAY GROTON CT 06340
JIM FRANCÊS 4981 DAVIS AVE GROTON CT 06340
ANNA LIMÃO 4573 MORGAN ST GROTON CT 06340
TRIBBLE 4474 RINEHART RD GROTON CT 06340

 

Well, it looks like Tribble the cat is getting a coupon! At least she’s not a current customer...

 

What other fuzzy matching techniques have you tried in SAS? How do they compare to using SAS Data Quality match codes?

 

Until next time, may the SAS be with you!

Mark

 

PS: You can grab the code and data ZIP file for this blog series here. I'll update the file with each new release. 

 

Previous blogs in this series:

Part 1 – Fundamentals

Part 2 – Standardization

Part 3 – Fuzzy Matching with Match Codes - This One

Part 4 – Extracting Data from Text Files

 

 

Comments

Great walkthrough of fuzzy matching Mark! I also loved that TIGER MAN made the list. 😻

Version history
Last update:
‎12-07-2023 08:51 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags