DATA Step, Macro, Functions and more

Fuzzy Matching

Reply
Contributor
Posts: 32

Fuzzy Matching

Hi,

 

Hopefully somebody can help me. I have 2 data tables. Table one has a singular column that contains a list of approx 10,000 business names such as "AstraZeneca Group", "Zurich Insurance Group", "Virgin Money PLC", etc. Table 2 contains a column that holds free text. From the column in table 2, I want to run a fuzzy match against the data stored in table 1. Therefore, if there are occurrences of "AstraZeneca Group" in table 2 in the free text column then create a new column "AstraZeneca" with a value of 1. Is there a slick and efficient way of doing this that can screen the free text column (in table 2) against a large list of variables? I am using Base SAS.

 

Thanks for your support.

 

Regards

Chris

 

Frequent Contributor
Posts: 93

Re: Fuzzy Matching

Here is a start.  I would try changing all to lower case and also consider soundex for misspellings.

 

data have;
infile cards delimiter=',' missover;
length ref $ 1000;
input ref $;
cards;
AstraZeneca Group,
MyGroup Rules
;
run;

data want;
infile cards delimiter=',' missover;  
length ref $ 1000;
input ref $;
cards;
I love the AstraZeneca Group!!!!!  It is so awesome!,
Oh yeah I totally love the AstraZeneca Group they rule!,
Have you ever tried MyGroup Rules?  It is sweeeeeeeet!
;

PROC SQL;
CREATE TABLE JOINS AS 
SELECT HAVE.REF AS REF1, WANT.REF AS REF2 FROM HAVE
LEFT JOIN WANT ON TRIM(WANT.REF) CONTAINS TRIM(HAVE.REF);
RUN;
PROC Star
Posts: 7,360

Re: Fuzzy Matching

There are two parts to your question. First, parsing nouns and noun phrases from text. I think you'll have to go outside of SAS for that part, namely to use one of the Natural Language Processing (NLP) packages. Take a look at: http://stackoverflow.com/questions/10974532/extracting-noun-phrases-from-a-text-file-using-stanford-...

 

As for the second part, I have found the compged function to be the most powerful for such tasks. You can find one example of how you can incorporate it in your code at: http://www.sascommunity.org/wiki/Expert_Panel_Solution_MWSUG_2013-Tabachneck

 

Art, CEO, AnalystFinder.com

 

Ask a Question
Discussion stats
  • 2 replies
  • 159 views
  • 0 likes
  • 3 in conversation