10-10-2013 09:29 PM
I am working with a dataset which contains addresses of businesses. There are many non-exact duplicates because of the address being written in different ways (eg marsden street, marsden st). Is there a way to combine these addresses and find duplicates.
10-10-2013 09:43 PM
Actually it is hard to find them. You have to clean the data using tranwrd function
address =tranwrd(address,'st.', 'Street');
address =tranwrd(address,'St.', 'Street');
address =tranwrd(address,'st', 'Street');
address =tranwrd(address,'St', 'Street');
address =tranwrd(address,'street', 'Street');
you can use all of them in one data step. Alternatively you break the address into parts. You will also find street is written as avenue or road or rd... I had gone through that for a dataset with 4 million records; Sort them by street name and suburb name, and then it would be easier.
10-11-2013 10:59 AM
This is a long shot but if you have a GIS product, either SAS or other that does geocoding from addresses, I would give that a shot. Because your situation is pretty typical many geocoding applications know how to handle variations of many address components. It could also provide you a list of those not codeable which could identify the really creative spellings.
10-11-2013 09:20 PM
To standardize addresses is a typical task where you would use DataFlux.
You can start and code this by yourself as Mit proposes but it will be a lot of work and the result will never be as good as what comes almost "out-of-the-box" with DataFlux.
10-13-2013 06:39 PM
I agree with ballardw. I have used Mapinfo long ego. But the prolem is that this will match to the correct address If the addresses are mis-spelled then there is no other way than cleaning the data. So I followed the following steps:
1. sort and summarise by the street and suburb.
2.Match with Mapinfo
3. Find out the addresses not matched and then clean them