BookmarkSubscribeRSS Feed
saunvida
Fluorite | Level 6

Hello,

I have been researching various operations that can be performed on textual data in SAS Data Quality and am particularly interested in the Matching operation. While going through the DataFlux documentation, I have gained a fair understanding of concepts like match definitions, schemes, and chop tables.

 

However, I am curious about the fuzzy matching techniques used to generate matchcodes. Specifically, I would like to understand:

What fuzzy matching techniques or algorithms are used in SAS Data Quality for matchcode generation?
Are these techniques based on phonetic algorithms (e.g., Soundex) or string similarity measures (e.g., Levenshtein distance, Jaro-Winkler)?

 

I understand the what of match definitions and matchcodes, but now I am keen to dive deeper into the how. Any guidance or references would be greatly appreciated!

Thank you in advance!

2 REPLIES 2
Rama_V
Obsidian | Level 7

Hi 

I hope you are still interested in this topic.

The QKB is used in generating Fuzzy Codes. It is a collection of file definitions, schemas, chop tables, phonetic libraries, regex Libraries, vocabularies, and grammars. These files can be edited in the Data Flux Management Studio Application. If out-of-the-box rules don't meet your organisation's needs then you can add or edit files. For me, I struggled with vocabulary and grammar files.

 

Match code generation follows a series of steps, using the rules from those files to tidy the string at each step to remove noise, standardize, normalize, phonetic reduction, and create a Matchcode Layout. It is a lot more than just the Soundex function. 

QKB Definition StepsQKB Definition Steps

 

The sensitivity defines the number of characters used to create a fuzzy code. 

SensitivitySensitivity

Finally, MatchCode is an unencoded string, that is converted to is encoded Fuzzy String. The encoding logic is hidden but the node generates it based on characters.
EncodingEncoding

For each value in the column data, these steps are followed to generate match codes. For a data value, Matchcodes do not change unless the QKB version or rules are edited. Hope this gives some understanding of the generation of Match Codes.

If you are interested in learning and editing those files then there are QKB courses on SAS Learning, 
Wish you luck.

Rama

SASKiwi
PROC Star

It's also worth understanding that SAS Data Quality's QKBs are country-specific. If you are trying to match addresses say in India then you need the Indian QKB which is customised to deal with the unique Indian address formatting.

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 2472 views
  • 2 likes
  • 3 in conversation