Solved
Contributor
Posts: 27

coding fuzzy expression

hi all.

i would like to recode a list of mistyiped string expression in the correct expression.

for istance to recode mistyped names of cities (now yrk, news yurk, new yorkk) into another variable with the correct name (new york).

of course i know the use of "if-then" or proc sql "%" statements, but i would like to use fuzzy logic statements, if they are possible in SAS.

thanks a lot

Accepted Solutions
Solution
‎09-13-2013 07:55 AM
Super Contributor
Posts: 312

Re: coding fuzzy expression

Hi Progster,

The following is an example of what Vince is talking about in respect to COMPGED.  I used it ages ago as a POC for something at work and have adapted it to the scenario you discussed.  I originally obtained the code from the paper http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf if you are interested in learning more.

DATA STATELIST;

STATE = 'NEW YORK';

OUTPUT;

STATE = 'FLORIDA';

OUTPUT;

STATE = 'TEXAS';

OUTPUT;

STATE = 'OHIO';

OUTPUT;

RUN;

DATA STRINGS;

LENGTH STRING \$8;

STRING = 'NEWYORK';

OUTPUT;

STRING = 'FLIRIDA';

OUTPUT;

STRING = 'TEXASS';

OUTPUT;

STRING = 'OHIOS';

OUTPUT;

STRING = 'NEW YORK';

OUTPUT;

STRING = 'FLORIDA';

OUTPUT;

STRING = 'TEXAS';

OUTPUT;

STRING = 'OHIO';

OUTPUT;

RUN;

%LET MAXSCORE=150;

PROC SQL;

CREATE TABLE MATCHES_SQL AS

SELECT A.STATE, B.STRING AS JOKE2,

COMPGED(A.STATE,B.STRING,&MAXSCORE,'IL' ) AS GEDSCORE

FROM STATELIST A, STRINGS B

WHERE CALCULATED GEDSCORE < &MAXSCORE

ORDER BY CALCULATED GEDSCORE;

QUIT;

All Replies
Super Contributor
Posts: 339

Re: coding fuzzy expression

There are multiple approaches to this depending on how big your set is and whether you know all possible errors (or almost) or not.

If you are aware or have a list of all possible errors, you could either hard code cases with perl regular expression logic (avoids using tons of embeded ifs) but if you have a wide range of city names and an even wider list of possible errors, this won't be of much help.

As far as I am aware, there are no "fuzzy-matching" function per say in SAS. However, there are tools to build your own. Namely, look for proc COMPLEV and COMPGED (levenstein and generalized edit distances).

If you know of a .JAR (java) implemented fuzzy-matching algorithm that's availible online, you can consider downloading it and using java objects to run your city names through the java fuzzy-matching algorithm. It is slightly more complicated than regular sas syntax programatically but well worth the learning.

I believe someone pointed out that there was a new operator or function for "sounds like" in a different thread but I believe this was for sas 9.4 and I sadly don't remember the syntax to point to you what to look into. Similarly (and it may or may not use the same sound encoding algorithm), there exists the SOUNDEX function which is english biased so unless you have international cities or that most entries were done by foreign residents/immigrants with only little english knowledge, it could be a useful tool.

I had built a small proc fcmp routine that took an array _temporary_ (as a dictionary of words) as an input as well as the words that I was trying to fuzzy match that would output a comma delimited list of all words in the dictionary with the least COMPLEV distance - just pointing out a tool and approach which you could implement your fuzzy match as a function that uses basically all but the java objects.

That's all I can think of

Good luck!

Vince

Solution
‎09-13-2013 07:55 AM
Super Contributor
Posts: 312

Re: coding fuzzy expression

Hi Progster,

The following is an example of what Vince is talking about in respect to COMPGED.  I used it ages ago as a POC for something at work and have adapted it to the scenario you discussed.  I originally obtained the code from the paper http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf if you are interested in learning more.

DATA STATELIST;

STATE = 'NEW YORK';

OUTPUT;

STATE = 'FLORIDA';

OUTPUT;

STATE = 'TEXAS';

OUTPUT;

STATE = 'OHIO';

OUTPUT;

RUN;

DATA STRINGS;

LENGTH STRING \$8;

STRING = 'NEWYORK';

OUTPUT;

STRING = 'FLIRIDA';

OUTPUT;

STRING = 'TEXASS';

OUTPUT;

STRING = 'OHIOS';

OUTPUT;

STRING = 'NEW YORK';

OUTPUT;

STRING = 'FLORIDA';

OUTPUT;

STRING = 'TEXAS';

OUTPUT;

STRING = 'OHIO';

OUTPUT;

RUN;

%LET MAXSCORE=150;

PROC SQL;

CREATE TABLE MATCHES_SQL AS

SELECT A.STATE, B.STRING AS JOKE2,

COMPGED(A.STATE,B.STRING,&MAXSCORE,'IL' ) AS GEDSCORE

FROM STATELIST A, STRINGS B

WHERE CALCULATED GEDSCORE < &MAXSCORE

ORDER BY CALCULATED GEDSCORE;

QUIT;

🔒 This topic is solved and locked.