Help using Base SAS procedures

Address Matching

Reply
Frequent Contributor
Posts: 77

Address Matching

Here are two address:

128 W. Main Street, Noland, NW
128 West Main St., Noland, NW

There are the same address. I want to know if SAS has any tool to tell me they are the same. The addresses can be a lot more complicated than the above. It is hard to come up the rules and put into a program to match them all.
Super Contributor
Super Contributor
Posts: 365

Re: Address Matching

Hello MarcTC,

I think that this can be achieved by formats. However, you still have to determine all rules to identify possible variations. For example, for street:
[pre]
proc format;
value $st
"ST."="Street"
"ST"="Street"
"Street"="Street"
;
run;
[/pre]
Sincerely,
SPR
SAS Super FREQ
Posts: 8,868

Re: Address Matching

Hi:
I know that the DataFlux tool, available as part of the Data Integration Studio can be used to "clean up" data like you show.

For more information, refer to:
http://www.dataflux.com/home.aspx?lang=en-us
http://www.sas.com/data-quality/df-integration-server/index.html
http://www.sas.com/technologies/dw/index.html

I'm not sure whether the DataFlux Studio is still a standalone product or not. You may want to check with your Sales Rep or with Tech Support on this.

cynthia
Trusted Advisor
Posts: 2,116

Re: Address Matching

I'm not sure that SAS has that many built in tools (though it does have the functions that allow one to build them). However, there is lots of commercial software for doing "address normalization." I am familiar with them for US mailings (I've used this company, but there are others, http://www.melissadata.com/ ).

If you first ran all of your addresses through a program like they sell, then the matching process would be much simpler.
Frequent Contributor
Posts: 77

Re: Address Matching

Can Proc Geocode do such address standardization/correction/normalization? I don't have SAS 9.2. TS2M3, so can't test it out.

According to some internet posts, Google MAP API can do this task. SAS has a product called SAS Google Map Generator. I wonder if this generator allows users to access Google MAP's address normalization function.
Respected Advisor
Posts: 4,173

Re: Address Matching

Hi

I believe with SAS it's either the data quality server (data flux) or you have to develop a set of Regular Expressions (which will be painfull).

Just did a quick google search.
There are sites with RegEx patterns which might help you, i.e:
http://regexlib.com/DisplayPatterns.aspx?categoryId=7&cattabindex=6

My thinking is:
You could use PRXCHANGE() to transform similar patterns to one standard string - and then compare these standard strings.


HTH
Patrick

Message was edited by: Patrick
Frequent Contributor
Posts: 77

Re: Address Matching

Just check online articles. It seems PROC GEOCODE's street level address geocoding only produce X/Y coordinateness and no normalized address. Can I use X/Y coordinates to match the data?
SAS Employee
Posts: 10

Re: Address Matching

Doc and other resources for SAS Data Quality Server are on its Software Product page:

http://support.sas.com/software/products/dataqual/

SAS Data Quality server is sold as part of the SAS Data Quality Solution, and it is also available through the software offerings, SAS Data Integration Server and SAS Enterprise Data Integration Server:
http://support.sas.com/documentation/onlinedoc/dis/
http://support.sas.com/documentation/onlinedoc/entdis/

These offerings include DataFlux products that the SAS Data Quality Server interacts with.
N/A
Posts: 1

Re: Address Matching

I have done this process before and the basic procedure was to first of all separate out the parts of the address into their own fields (ie. street_address, city, state, zip). The easiest way to do the former is with regular expressions.

Secondly you will need to standardize the wording in the addresses, eg. substitute all abbreviations with the full name. In your example above, if you came across "W" or "W." or "Wst" etc in your street_address field change them all to "West". The easiest way to get a list of common abbreviations is to 'tokenize' the entire address so that you get a frequency count of the words used in the dataset. Common abbreviations like Rd, Ln, St etc will bubble to the top. You then manually make a mapping using whatever technique you like best.

Lastly you can use the SAS soundex() function to identify addresses that are the same but may contain typos or misspellings. Ie. Main Street, Main Streat, Maine Street would all be considered the same using the soundex() function. When you have a match on say the soundex(street_address) + zip + name you can be reasonably certain that it is the same address even when they have misspellings and/or typos.

Hope this helps.

Cheers
Rob
Ask a Question
Discussion stats
  • 8 replies
  • 1318 views
  • 0 likes
  • 7 in conversation