Find match in 2 different column

cynthya · Posted 11-13-2013 11:32 AM

Hi,

I would like to add a new column that will compare my 2 first columns and say if Yes or No the value are the same.

I have some problem with those values, because they are the same, but with some little difference. I would like those values to say Yes in my new column.

M. Jaques Richard	Jaques Richard
DominiqueBlais	Dominique Blais
J. R Paquin	J R Paquin
Bertrand Raymond	Raymond Bertrand

Difference in row 1: M.

Difference in row 2: space

Difference in row 3: doth

Difference in row 4: inverse name

Thank you for your help and time.

art297 · Posted 11-13-2013 02:29 PM

You would need to decide the criteria for both calculating closeness and deciding if two names are close enough to be a match, but the following might give you a place to at least start:

data have;

informat name1 name2 $50.;

input name1 & name2 &;

cards;

M. Jaques Richard Jaques Richard

DominiqueBlais Dominique Blais

J. R Paquin J R Paquin

Bertrand Raymond Raymond Bertrand

;

data want;

set have;

if _n_ = 1 then call compcost('insert=',10,

'punctuation=',0,'DEL=',10,'r=', 12,'swap=',10);

reverse=catt(scan(name2,2),scan(name2,1));

ged=min(compged(name1,name2),compged(name1,reverse));

soundex=max(soundex(name1)=soundex(name2),

soundex(name1)=soundex(reverse));

if soundex or ged le 20 then match="yes";

else match="no";

run;

Vince28_Statcan · Posted 11-13-2013 02:51 PM

There are many different approaches to string matching and knowledge of your data is what will dictate the best combination of tools to use for the job at hand. I had to do similar name matching at work but under a different composition of names.

The end result was an iterative process that involved family name matching first and given name matching afterwards which is probably not applicable given your data seems to be a single string blob but most concepts can be extended.

Picture of an iterative attempt to match

1. exact match

2. right truncation or left trunction

3. Soundex (helps catch rare names that could've been typed or written by 2 different individuals say over a phone interview instead of self-response)

4. Levenstein or Generalized Edits Distance (GED is more complete but requires knowledge of your data before hand to be put to full use - Arthur examplified it with compged but you can play with GED parameters)

5. Creation of an alternate name list (like Robert = Bob = Bobby = Robby - I don't know if such a table was ever published by any of the north american NSOs that you could use directly)

6. Word for word search (aka if all words of left string can be found using findw or find functions in the right string or vice versa, you could accept the match - This would handle "Vincent Martin" matching "Martin Vincent")

7. Preprocess your string and do any/all of the above again (or preprocess from the get go - that is, strip dots, commas, apostrophes, special characters, convert everything to capital letters, compress the strings into a single word or set insert/deletion of blank spaces to a cost of 0 in comp GED if that's a frequent issue) - Possibly add additionnal rules specific to your data to remove short words like M. Mme. Mrs. from your strings or kill single letter words all together like "Vincent J. Martin" would get preprocessed to "VINCENTMARTIN")

A good practice is to add a class variable that indicates what particular method (exact/soundex/trunc/fuzz(ged or lev)/alt name/ etc.) so as to help debug false match when doing manual review of your algorithm at a given iteration. For instance, if you use Lev or GED, you use a treshold of "difference between the two string". Maybe a fixed threshold is fine, maybe you'd want it to adjust dynamically with the length of the shortest of the two string? Aka avoid Company ABC matching with company GBG because 2 swaps falls within your levenstein threshold.

Vince28_Statcan · Posted 11-14-2013 07:44 AM

I will also add,

There is a handful of string proximity algorithms existing. Jaro-Winkler which, to my knowledge is not implemented in SAS (at least not as a function but there might be open source macros or FCMP routines on the web) is the most broadly used one for name matching when doing probabilistic record linkage. It is not necessarily the most suitable to your data though. For example, we have a paper questionnaire capture system that uses a third party off-the-shelf software along with a handful of other field-specific rules to improve accuracy and one of the biggest caveats of scanning paper and using intelligent recognition are very specific frequent character transitions (think M<->N, L<->i (case dependant) etc.). In a case like that, there's hardly any gain in using JR instead or GED or Levenstein. However, there is a substantial gain in creating fewer than 26 (27 if you do a all non alpha characters like dots, commas, hyphens, spaces etc) character classes and then comparing words from an exact class-matching approach.

mfab · Posted 11-14-2013 09:18 AM

In addition to:

4. Levenstein or Generalized Edits Distance (GED is more complete but requires knowledge of your data before hand to be put to full use - Arthur examplified it with compged but you can play with GED parameters)

there are already built-in functions that you can easily use in your SAS-code:

COMPLEV: SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition

COMPGED: SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition

That should provide a reasonable starting point and you can work your way from there 🙂