Hi there
I am working with a dataset which contains addresses of businesses. There are many non-exact duplicates because of the address being written in different ways (eg marsden street, marsden st). Is there a way to combine these addresses and find duplicates.
Thanks
Actually it is hard to find them. You have to clean the data using tranwrd function
example
address =tranwrd(address,'st.', 'Street');
address =tranwrd(address,'St.', 'Street');
address =tranwrd(address,'st', 'Street');
address =tranwrd(address,'St', 'Street');
address =tranwrd(address,'street', 'Street');
you can use all of them in one data step. Alternatively you break the address into parts. You will also find street is written as avenue or road or rd... I had gone through that for a dataset with 4 million records; Sort them by street name and suburb name, and then it would be easier.
Thanks Mit
This is a long shot but if you have a GIS product, either SAS or other that does geocoding from addresses, I would give that a shot. Because your situation is pretty typical many geocoding applications know how to handle variations of many address components. It could also provide you a list of those not codeable which could identify the really creative spellings.
To standardize addresses is a typical task where you would use DataFlux.
You can start and code this by yourself as Mit proposes but it will be a lot of work and the result will never be as good as what comes almost "out-of-the-box" with DataFlux.
I agree with ballardw. I have used Mapinfo long ego. But the prolem is that this will match to the correct address If the addresses are mis-spelled then there is no other way than cleaning the data. So I followed the following steps:
1. sort and summarise by the street and suburb.
2.Match with Mapinfo
3. Find out the addresses not matched and then clean them
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.