Computers demand standardization and precision. So, processing text data with a computer is dicey at best. The more humans get involved in entering or generating text data, the more problematic it can be. Humans can clearly see that John Q. Publix of Williamsburg, Virginia is the same as Jon Quincey Public of Wmbg, VA. But a computer will have none of that – it knows full well that those two text strings are not the same! SAS Data Quality match codes can provide a quick and easy solution.
Let’s imagine you own a company that sells cat food. To increase sales, you intend to solicit new customers by mailing them a coupon for a free sample. You decide to buy the Cat Lover’s Magazine mailing list as a resource because mailings are expensive, and sending random solicitations to folks who don’t own a cat is a waste of resources. But it’s wasteful to solicit existing customers, too. How can we compare the mailing list to our existing customer list so that we can solicit just the new people? Here is a sample of the data:
Customers
Name | Street Address | City | State |
Catherine M. Thomas | 3005 Brookside Drive | Guin | AL |
János L. von Neumann | 4026 Maple Lane | Huntsville | AL |
Sam C. Tennyson | 2715 Mulberry Avenue | Donaldson | AR |
Kristi R. Curry | 399 Masonic Hill Road | Little Rock | AR |
Wanda R. Smith | 2190 Cedar Street | Pine Bluff | AR |
Cat Lovers Mailing List
Name | Address | City | State |
BEAUTIFUL KITTY | 3005 BROOKSIDE DR | GUIN | AL |
JOHN VON NEUMANN | 4026 MAPLE LN | HUNTSVILLE | AL |
SAM CHARLES | 3750 MULBERRY AVE | GROTON | CT |
CHRISTY CURRY | 399 MASONIC HILL RD | LITTLE ROCK | AR |
TIGER MAN | 2190 CEDAR ST | PINE BLUFF | AR |
The computer wouldn’t match any of those rows, yet I can see some matches:
I think I should exclude names from the matching criteria. I’ll match households, using address, city, and state as the keys. But the mailing list is all upper case and uses standard abbreviations like “ST” for “Street” and “AVE” for “Avenue”. My customer data is in proper case and spells out those words. This could be a coding nightmare. SAS Data Quality to the rescue!
I’d like to use the ‘Address’ and ‘City - State/Province - Postal Code’ match codes to ensure I select only those Cat Lovers Magazine subscribers who are not in my customer data. What does a match code look like? This code will give me a peek:
proc SQL;
title "Address Match Codes";
select 'Customer List' as Source
,Street1,dqmatch(Street1,'Address') 'Match Code'
from dq.customers (obs=1)
union all
select 'Cat Lovers',Address,dqmatch(Address,'Address')
from dq.CatLovers (obs=1);
title "City/State Match Codes";
select 'Customer List' as Source,City, State
,dqmatch(CATX(', ',City, State)
,'City - State/Province - Postal Code') 'Match Code'
from dq.customers (obs=1)
union all
select 'Cat Lovers',City, State
,dqmatch(CATX(', ',City, State)
,'City - State/Province - Postal Code')
from dq.CatLovers (obs=1);
quit;
Results:
Address Match Codes
Source | Street Address | Match Code |
Customer List | 3005 Brookside Drive | $$$$$$$$$$$$$$K005$$$$$$$$$$MY&34&~ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ |
Cat Lovers | 3005 BROOKSIDE DR | $$$$$$$$$$$$$$K005$$$$$$$$$$MY&34&~ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ |
City/State Match Codes
Source | City | State | Match Code |
Customer List | Guin | AL | FP$$$$$&W$$$$$$ |
Cat Lovers | GUIN | AL | FP$$$$$&W$$$$$$ |
Well, that should make this easy! I won’t have to fiddle with the case or spelling of the text values - I can use the match codes and leave the data as-is. This would be easier if every row already included a match code, though. PROC DQMATCH can accomplish that pretty quickly:
proc dqmatch data=dq.customers out=work.customers(keep=CustomerID
Name Street1 City State Zip MatchID)
matchcode=matchID locale='ENUSA';
criteria var=Street1 matchdef='Address' ;
criteria var=City matchdef='City' ;
criteria var=State matchdef='City - State/Province - Postal Code';
run;
proc dqmatch data=dq.catlovers out=work.catlovers(keep=Name Address
City State Zip MatchID)
matchcode=matchID locale='ENUSA';
criteria var=Address matchdef='Address';
criteria var=City matchdef='City';
criteria var=State matchdef='City - State/Province - Postal Code';
run;
Finally, a join on the MatchID column should yield my coupon mailing list:
proc FedSQL;
select cl.Name,cl.Address,cl.City,cl.State,cl.Zip
from work.catlovers as cl
left join work.customers as c
on c.MatchID=cl.MatchID
where CustomerID is null;
quit;
Results:
Name | Address | City | State | Zip Code |
KRIS RENASCIMENTO | 2841 LAKELAND PARK DR | GROTON | CT | 06340 |
SAM CHARLES | 3750 MULBERRY AVE | GROTON | CT | 06340 |
DICK INDICADO | 3764 KERRY WAY | GROTON | CT | 06340 |
JIM FRANCÊS | 4981 DAVIS AVE | GROTON | CT | 06340 |
ANNA LIMÃO | 4573 MORGAN ST | GROTON | CT | 06340 |
TRIBBLE | 4474 RINEHART RD | GROTON | CT | 06340 |
Well, it looks like Tribble the cat is getting a coupon! At least she’s not a current customer...
What other fuzzy matching techniques have you tried in SAS? How do they compare to using SAS Data Quality match codes?
Until next time, may the SAS be with you!
Mark
PS: You can grab the code and data ZIP file for this blog series here. I'll update the file with each new release.
Part 3 – Fuzzy Matching with Match Codes - This One
Part 4 – Extracting Data from Text Files
Great walkthrough of fuzzy matching Mark! I also loved that TIGER MAN made the list. 😻
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.