fuzzy data merge or join

Reply
Frequent Contributor
Posts: 110

fuzzy data merge or join

SAS gurus,

            I have 2 data sets that i need to merge. The merging variable is character variable with length of 100. One variable name (name1) in dataset one is similar dataset two , but not exact name (like name2). please check the attachment to see the data. Any suggestions is appreciated.
         I have searched for options online and wasn't clear of what functions yield best results for my data. Please drop a line or two explaining what you are doing and what is the rationale behind it. 

 

thanks in advance.

Occasional Contributor
Posts: 13

Re: fuzzy data merge or join

Hi,

Look at the use of phonetic encoding algorithms or edit distance style measures which calculate the cost of turning string 1 into string 2.  In SAS these are the soundex function and complev which calculates edit distance.  If you have SAS Data Management you can also use the DQMATCH procedure to create a matchcode in DataFlux which performs a similar phonetically encoded string.  It's typically a good idea to clean your data beforehand as well.  Do things like;

 - make everything the same case

 - Remove punctuation if it is not important

 - Compress spaces  etc. if not important.  Look at the compress function for this.

 - Remove repeating terms and letters using compbl if applicable.

 - Standardise common terms such as company prefixes etc.

 

Once that is done, then the method is to create a new variable using soundex and merge by that new variable then use complev to calculate the similarity.  The lower the score the closer they match.

 

HTH.

Cameron

Regards,
Cameron | Selerity
PROC Star
Posts: 167

Re: fuzzy data merge or join

That's not so easy. You basically need to do an outer join, on the condition that one or more words from one name is in the other name, or vice versa. Here is a possible solution that uses a datastep to create the outer product (by reading every observation from the second dataset with POINT=) and compare:

data one;
input name1 $40. /amount1;
cards;
wwwamazoncom
100.5
toysrus
50.25
OLIVE GARDEN
61.85
walMart
86.24
;run;

data two;
input name2 $40. /amount2;
cards;
US AMAZON AR USA
25.68
online toysrus newjersey us
126.98
ORDER olivegarden Washington DC
29.99
us wwwwalmartcom toys texas
75.86
;run;

data want;
  set one;
  score=0;
  length common $60;
  do _N_=1 to nobs;
    set two nobs=nobs point=_N_;
    common=' ';
    score=0;
    do i=1 to countw(name1);
      if length(scan(name1,i))<3 then continue;
      if find(name2,scan(name1,i),'i') then do;
        score=score+1;
        if not findw(common,scan(name1,i),' ','i') then
          call catx(' ',common,scan(name1,i));
        end;
      end;
  do i=1 to countw(name2);
      if length(scan(name2,i))<3 then continue;
      if find(name1,scan(name2,i),'i') then do;
        score=score+1;
        if not findw(common,scan(name2,i),' ','i') then
          call catx(' ',common,scan(name2,i));
        end;
      end;
    if score>0 then output;
    end;
run;

Note that one pair of values were joined just on the word "toys", you may want to increase the length value in the line with "continue".

Frequent Contributor
Posts: 110

Re: fuzzy data merge or join

Thank you s_lassen and foobarbaz for replying. The solution given by S-lassen only applies to this sample data and we all know real world data is messy. But, I got valuable information from both of you and Thanks again. 

Ask a Question
Discussion stats
  • 3 replies
  • 101 views
  • 0 likes
  • 3 in conversation