128 W. Main Street, Noland, NW
128 West Main St., Noland, NW
There are the same address. I want to know if SAS has any tool to tell me they are the same. The addresses can be a lot more complicated than the above. It is hard to come up the rules and put into a program to match them all.
I think that this can be achieved by formats. However, you still have to determine all rules to identify possible variations. For example, for street:
I'm not sure that SAS has that many built in tools (though it does have the functions that allow one to build them). However, there is lots of commercial software for doing "address normalization." I am familiar with them for US mailings (I've used this company, but there are others, http://www.melissadata.com/ ).
If you first ran all of your addresses through a program like they sell, then the matching process would be much simpler.
Can Proc Geocode do such address standardization/correction/normalization? I don't have SAS 9.2. TS2M3, so can't test it out.
According to some internet posts, Google MAP API can do this task. SAS has a product called SAS Google Map Generator. I wonder if this generator allows users to access Google MAP's address normalization function.
I have done this process before and the basic procedure was to first of all separate out the parts of the address into their own fields (ie. street_address, city, state, zip). The easiest way to do the former is with regular expressions.
Secondly you will need to standardize the wording in the addresses, eg. substitute all abbreviations with the full name. In your example above, if you came across "W" or "W." or "Wst" etc in your street_address field change them all to "West". The easiest way to get a list of common abbreviations is to 'tokenize' the entire address so that you get a frequency count of the words used in the dataset. Common abbreviations like Rd, Ln, St etc will bubble to the top. You then manually make a mapping using whatever technique you like best.
Lastly you can use the SAS soundex() function to identify addresses that are the same but may contain typos or misspellings. Ie. Main Street, Main Streat, Maine Street would all be considered the same using the soundex() function. When you have a match on say the soundex(street_address) + zip + name you can be reasonably certain that it is the same address even when they have misspellings and/or typos.