- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am working with a data set that contains addresses and the goal is to do a left join with a second data set and see which addresses are not found in both, but before I can do the left join, I need to clean up the addresses in my database and I'm looking for help on how to do that.
Some addresses have the street number, street name, street type. Some contain a suite number, some have N/S/E/W, some have St instead of street, Cres, Blvd, etc.
Anyone have any idea how I can clean this up so all addresses have the same format?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A lot of previous articles and approaches listed here:
https://www.lexjansen.com/search/searchresults.php?q=clean%20up%20addresses
Or search on here.
However, if you have access to SAS Data Quality Studio - that's your best bet for sure. Pass both data sets through there to get cleaned and then you should be able to merge them much easier. Having experimented with this exact use case, Data Quality Studio had the best results, hands down.
@celbel wrote:
Hello,
I am working with a data set that contains addresses and the goal is to do a left join with a second data set and see which addresses are not found in both, but before I can do the left join, I need to clean up the addresses in my database and I'm looking for help on how to do that.
Some addresses have the street number, street name, street type. Some contain a suite number, some have N/S/E/W, some have St instead of street, Cres, Blvd, etc.
Anyone have any idea how I can clean this up so all addresses have the same format?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I second @Reeza 's response. I'm currently using SAS Data Quality for this very purpose. Cleaning up, correcting, standardising and matching addresses is a huge job. Any tool that automates these processes is definitely worth considering. Unfortunately it's a separate SAS product with its own licensing cost.
I suggest you check your data using exact matches on unchanged data to start with to get a better idea of how good your data quality is.