topic Re: Finding and merging duplicates in a dataset in SAS Procedures

Finding and merging duplicates in a dataset

KingJ — Fri, 11 Oct 2013 01:29:44 GMT

Hi there

I am working with a dataset which contains addresses of businesses. There are many non-exact duplicates because of the address being written in different ways (eg marsden street, marsden st). Is there a way to combine these addresses and find duplicates.

Thanks

Re: Finding and merging duplicates in a dataset

Mit — Fri, 11 Oct 2013 01:43:38 GMT

Actually it is hard to find them. You have to clean the data using tranwrd function

example

address =tranwrd(address,'st.', 'Street');

address =tranwrd(address,'St.', 'Street');

address =tranwrd(address,'st', 'Street');

address =tranwrd(address,'St', 'Street');

address =tranwrd(address,'street', 'Street');

you can use all of them in one data step. Alternatively you break the address into parts. You will also find street is written as avenue or road or rd... I had gone through that for a dataset with 4 million records; Sort them by street name and suburb name, and then it would be easier.

Re: Finding and merging duplicates in a dataset

KingJ — Fri, 11 Oct 2013 02:35:21 GMT

Thanks Mit

Re: Finding and merging duplicates in a dataset

ballardw — Fri, 11 Oct 2013 14:59:09 GMT

This is a long shot but if you have a GIS product, either SAS or other that does geocoding from addresses, I would give that a shot. Because your situation is pretty typical many geocoding applications know how to handle variations of many address components. It could also provide you a list of those not codeable which could identify the really creative spellings.

Re: Finding and merging duplicates in a dataset

Patrick — Sat, 12 Oct 2013 01:20:27 GMT

To standardize addresses is a typical task where you would use DataFlux.

You can start and code this by yourself as Mit proposes but it will be a lot of work and the result will never be as good as what comes almost "out-of-the-box" with DataFlux.

Re: Finding and merging duplicates in a dataset

Mit — Sun, 13 Oct 2013 22:39:47 GMT

I agree with ballardw. I have used Mapinfo long ego. But the prolem is that this will match to the correct address If the addresses are mis-spelled then there is no other way than cleaning the data. So I followed the following steps:

1. sort and summarise by the street and suburb.

2.Match with Mapinfo

3. Find out the addresses not matched and then clean them