01-30-2018 06:47 AM
I have 2 data sets that i need to merge. The merging variable is character variable with length of 100. One variable name (name1) in dataset one is similar dataset two , but not exact name (like name2). please check the attachment to see the data. Any suggestions is appreciated.
I have searched for options online and wasn't clear of what functions yield best results for my data. Please drop a line or two explaining what you are doing and what is the rationale behind it.
thanks in advance.
01-30-2018 07:54 AM
Look at the use of phonetic encoding algorithms or edit distance style measures which calculate the cost of turning string 1 into string 2. In SAS these are the soundex function and complev which calculates edit distance. If you have SAS Data Management you can also use the DQMATCH procedure to create a matchcode in DataFlux which performs a similar phonetically encoded string. It's typically a good idea to clean your data beforehand as well. Do things like;
- make everything the same case
- Remove punctuation if it is not important
- Compress spaces etc. if not important. Look at the compress function for this.
- Remove repeating terms and letters using compbl if applicable.
- Standardise common terms such as company prefixes etc.
Once that is done, then the method is to create a new variable using soundex and merge by that new variable then use complev to calculate the similarity. The lower the score the closer they match.
01-30-2018 08:01 AM
That's not so easy. You basically need to do an outer join, on the condition that one or more words from one name is in the other name, or vice versa. Here is a possible solution that uses a datastep to create the outer product (by reading every observation from the second dataset with POINT=) and compare:
data one; input name1 $40. /amount1; cards; wwwamazoncom 100.5 toysrus 50.25 OLIVE GARDEN 61.85 walMart 86.24 ;run; data two; input name2 $40. /amount2; cards; US AMAZON AR USA 25.68 online toysrus newjersey us 126.98 ORDER olivegarden Washington DC 29.99 us wwwwalmartcom toys texas 75.86 ;run; data want; set one; score=0; length common $60; do _N_=1 to nobs; set two nobs=nobs point=_N_; common=' '; score=0; do i=1 to countw(name1); if length(scan(name1,i))<3 then continue; if find(name2,scan(name1,i),'i') then do; score=score+1; if not findw(common,scan(name1,i),' ','i') then call catx(' ',common,scan(name1,i)); end; end; do i=1 to countw(name2); if length(scan(name2,i))<3 then continue; if find(name1,scan(name2,i),'i') then do; score=score+1; if not findw(common,scan(name2,i),' ','i') then call catx(' ',common,scan(name2,i)); end; end; if score>0 then output; end; run;
Note that one pair of values were joined just on the word "toys", you may want to increase the length value in the line with "continue".
01-31-2018 07:36 PM
Thank you s_lassen and foobarbaz for replying. The solution given by S-lassen only applies to this sample data and we all know real world data is messy. But, I got valuable information from both of you and Thanks again.