Hello,
I have been researching various operations that can be performed on textual data in SAS Data Quality and am particularly interested in the Matching operation. While going through the DataFlux documentation, I have gained a fair understanding of concepts like match definitions, schemes, and chop tables.
However, I am curious about the fuzzy matching techniques used to generate matchcodes. Specifically, I would like to understand:
What fuzzy matching techniques or algorithms are used in SAS Data Quality for matchcode generation?
Are these techniques based on phonetic algorithms (e.g., Soundex) or string similarity measures (e.g., Levenshtein distance, Jaro-Winkler)?
I understand the what of match definitions and matchcodes, but now I am keen to dive deeper into the how. Any guidance or references would be greatly appreciated!
Thank you in advance!
Hi
I hope you are still interested in this topic.
The QKB is used in generating Fuzzy Codes. It is a collection of file definitions, schemas, chop tables, phonetic libraries, regex Libraries, vocabularies, and grammars. These files can be edited in the Data Flux Management Studio Application. If out-of-the-box rules don't meet your organisation's needs then you can add or edit files. For me, I struggled with vocabulary and grammar files.
Match code generation follows a series of steps, using the rules from those files to tidy the string at each step to remove noise, standardize, normalize, phonetic reduction, and create a Matchcode Layout. It is a lot more than just the Soundex function.
QKB Definition Steps
The sensitivity defines the number of characters used to create a fuzzy code.
Sensitivity
Finally, MatchCode is an unencoded string, that is converted to is encoded Fuzzy String. The encoding logic is hidden but the node generates it based on characters.Encoding
For each value in the column data, these steps are followed to generate match codes. For a data value, Matchcodes do not change unless the QKB version or rules are edited. Hope this gives some understanding of the generation of Match Codes.
If you are interested in learning and editing those files then there are QKB courses on SAS Learning,
Wish you luck.
Rama
It's also worth understanding that SAS Data Quality's QKBs are country-specific. If you are trying to match addresses say in India then you need the Indian QKB which is customised to deal with the unique Indian address formatting.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.