Hello takheeljeff,
I think that "data _null_" gave you a good advice.
Here is a little example that shows you how to implement the matching:
data T01_input;
length name $20;
infile cards delimiter=',';
input id name;
cards;
1,spiderman
2,superman
3,wonder woman
4,wonder woman
5,felix the cat
6,felix le chat
7,mickey mouse
8,samsam
;
run;
proc sql;
create table T02_matches as
select A.id as id1, A.name as name1, B.id as id2, B.name as name2,spedis(A.name,B.name) as distance
from T01_input A, T01_input B
where spedis(A.name,B.name) le 50 and A.id ne B.id;
quit;
It will give the following result:
id1=1 name1=spiderman id2=2 name2=superman distance=27
id1=2 name1=superman id2=1 name2=spiderman distance=25
id1=3 name1=wonder woman id2=4 name2=wonder woman distance=0
id1=4 name1=wonder woman id2=3 name2=wonder woman distance=0
id1=5 name1=felix the cat id2=6 name2=felix le chat distance=19
id1=6 name1=felix le chat id2=5 name2=felix the cat distance=19
Pay attention to the following remarks:
- in order to make the comparisons, we are using a "cartesian product". Take a look at the where clause: it will evaluate all possible pairs of records and will consider as valid matches those that have a spedis distance less than 50. Watch out: if your sample is big, this can imply a lot of comparisons. For example, for a sample containing 1000 records, you will make 1000*1000 comparisons, ie, 1Million comparisons. Tha'ts a lot. In order to reduce the number of comparisons, it is common to "add something" to the where clause. For example, you could decide to restrict comparisons for records whose addresses have the same zip codes. This is called "blocking".
- it is up to you to put a threshold on the distance (I took 50 ... but you could use something else). For very large datasets, it makes sense to do "some learning", ie, you take a subset of comparisons and you decide whether each of these comparisons are correct or false matches; based on that experience, you set the threshold. You could even have two threshold: 1 for "I'm sure", 1 for "To be checked" and 1 for "incorrect match". There are algorithms for the learning part ... but that's another story.
- watch out for false matches. You'll get some. For example: spiderman is not the same as superman.
You could base your matching on words instead of strings.
I hope this will be helpful to you,
Yoba
Message was edited by: yoba