BookmarkSubscribeRSS Feed
MarcTC
Obsidian | Level 7
Here are two address:

128 W. Main Street, Noland, NW
128 West Main St., Noland, NW

There are the same address. I want to know if SAS has any tool to tell me they are the same. The addresses can be a lot more complicated than the above. It is hard to come up the rules and put into a program to match them all.
8 REPLIES 8
SPR
Quartz | Level 8 SPR
Quartz | Level 8
Hello MarcTC,

I think that this can be achieved by formats. However, you still have to determine all rules to identify possible variations. For example, for street:
[pre]
proc format;
value $st
"ST."="Street"
"ST"="Street"
"Street"="Street"
;
run;
[/pre]
Sincerely,
SPR
Cynthia_sas
SAS Super FREQ
Hi:
I know that the DataFlux tool, available as part of the Data Integration Studio can be used to "clean up" data like you show.

For more information, refer to:
http://www.dataflux.com/home.aspx?lang=en-us
http://www.sas.com/data-quality/df-integration-server/index.html
http://www.sas.com/technologies/dw/index.html

I'm not sure whether the DataFlux Studio is still a standalone product or not. You may want to check with your Sales Rep or with Tech Support on this.

cynthia
Doc_Duke
Rhodochrosite | Level 12
I'm not sure that SAS has that many built in tools (though it does have the functions that allow one to build them). However, there is lots of commercial software for doing "address normalization." I am familiar with them for US mailings (I've used this company, but there are others, http://www.melissadata.com/ ).

If you first ran all of your addresses through a program like they sell, then the matching process would be much simpler.
MarcTC
Obsidian | Level 7
Can Proc Geocode do such address standardization/correction/normalization? I don't have SAS 9.2. TS2M3, so can't test it out.

According to some internet posts, Google MAP API can do this task. SAS has a product called SAS Google Map Generator. I wonder if this generator allows users to access Google MAP's address normalization function.
Patrick
Opal | Level 21
Hi

I believe with SAS it's either the data quality server (data flux) or you have to develop a set of Regular Expressions (which will be painfull).

Just did a quick google search.
There are sites with RegEx patterns which might help you, i.e:
http://regexlib.com/DisplayPatterns.aspx?categoryId=7&cattabindex=6

My thinking is:
You could use PRXCHANGE() to transform similar patterns to one standard string - and then compare these standard strings.


HTH
Patrick

Message was edited by: Patrick
MarcTC
Obsidian | Level 7
Just check online articles. It seems PROC GEOCODE's street level address geocoding only produce X/Y coordinateness and no normalized address. Can I use X/Y coordinates to match the data?
SusanJ516_sas
SAS Employee
Doc and other resources for SAS Data Quality Server are on its Software Product page:

http://support.sas.com/software/products/dataqual/

SAS Data Quality server is sold as part of the SAS Data Quality Solution, and it is also available through the software offerings, SAS Data Integration Server and SAS Enterprise Data Integration Server:
http://support.sas.com/documentation/onlinedoc/dis/
http://support.sas.com/documentation/onlinedoc/entdis/

These offerings include DataFlux products that the SAS Data Quality Server interacts with.
r_bomb
Calcite | Level 5
I have done this process before and the basic procedure was to first of all separate out the parts of the address into their own fields (ie. street_address, city, state, zip). The easiest way to do the former is with regular expressions.

Secondly you will need to standardize the wording in the addresses, eg. substitute all abbreviations with the full name. In your example above, if you came across "W" or "W." or "Wst" etc in your street_address field change them all to "West". The easiest way to get a list of common abbreviations is to 'tokenize' the entire address so that you get a frequency count of the words used in the dataset. Common abbreviations like Rd, Ln, St etc will bubble to the top. You then manually make a mapping using whatever technique you like best.

Lastly you can use the SAS soundex() function to identify addresses that are the same but may contain typos or misspellings. Ie. Main Street, Main Streat, Maine Street would all be considered the same using the soundex() function. When you have a match on say the soundex(street_address) + zip + name you can be reasonably certain that it is the same address even when they have misspellings and/or typos.

Hope this helps.

Cheers
Rob

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 3860 views
  • 0 likes
  • 7 in conversation