How perform a fuzzy prxmatch or fuzzy index search

jjknknl · Posted 02-12-2020 04:19 PM

I have a variable that contains free text inputted by users, and I need to know which entries contain a particular text string, allowing for slight misspellings (for example, allowing for the total number of insertions, deletions, or replacements to be less than N). The COMPLEV function only seems to compare two strings, and the prxmatch or index functions don't seem to allow for fuzzy matching like this (i.e., I would have to specify all the possible patterns i was willing to accept). What is the easiest way for me to accomplish this?

For example, say i have the following dataset s1

data s1;
length text $500;
input text &;
id = _n_;
datalines;
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium
;
run;

And say I want to search the "text" field to see which rows contain the string "edipiscing", allowing for slight spelling differences--for example, allowing for at most 1 character insertion, deletion, or replacement.

I could use prxmatch like this

proc sql;
select *
from s1
where prxmatch('/edipiscing/i', text)>0
;
quit;

But it would not find it in the first row, because there is one character replacement (in the first letter). I could do

proc sql;
select *
from s1
where prxmatch('/[a-z]dipiscing/i', text)>0
;
quit;

But i don't want to have to specify all possible patterns. Is there a SAS function that searches for the presence of a text string allowing for fuzzy matches?

brantk · Posted 02-12-2020 05:10 PM

Hi jjknknl,

This document may help if you are using SAS functions: https://www.sas.com/content/dam/SAS/en_ca/User%20Group%20Presentations/TASS/fogarasi_fuzzy_matching....

If you have SAS Data Quality, you can refer to this document. See PROC DQMATCH and the DQMATCH function: https://go.documentation.sas.com/?cdcId=dqcdc&cdcVersion=3.4&docsetId=dqclref&docsetTarget=titlepage...

How perform a fuzzy prxmatch or fuzzy index search

Re: How perform a fuzzy prxmatch or fuzzy index search

Ready to join fellow brilliant minds for the SAS Hackathon?