- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Posted 12-09-2010 03:08 PM
(4852 views)
Here are two address:
128 W. Main Street, Noland, NW
128 West Main St., Noland, NW
There are the same address. I want to know if SAS has any tool to tell me they are the same. The addresses can be a lot more complicated than the above. It is hard to come up the rules and put into a program to match them all.
128 W. Main Street, Noland, NW
128 West Main St., Noland, NW
There are the same address. I want to know if SAS has any tool to tell me they are the same. The addresses can be a lot more complicated than the above. It is hard to come up the rules and put into a program to match them all.
8 REPLIES 8
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello MarcTC,
I think that this can be achieved by formats. However, you still have to determine all rules to identify possible variations. For example, for street:
[pre]
proc format;
value $st
"ST."="Street"
"ST"="Street"
"Street"="Street"
;
run;
[/pre]
Sincerely,
SPR
I think that this can be achieved by formats. However, you still have to determine all rules to identify possible variations. For example, for street:
[pre]
proc format;
value $st
"ST."="Street"
"ST"="Street"
"Street"="Street"
run;
[/pre]
Sincerely,
SPR
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
I know that the DataFlux tool, available as part of the Data Integration Studio can be used to "clean up" data like you show.
For more information, refer to:
http://www.dataflux.com/home.aspx?lang=en-us
http://www.sas.com/data-quality/df-integration-server/index.html
http://www.sas.com/technologies/dw/index.html
I'm not sure whether the DataFlux Studio is still a standalone product or not. You may want to check with your Sales Rep or with Tech Support on this.
cynthia
I know that the DataFlux tool, available as part of the Data Integration Studio can be used to "clean up" data like you show.
For more information, refer to:
http://www.dataflux.com/home.aspx?lang=en-us
http://www.sas.com/data-quality/df-integration-server/index.html
http://www.sas.com/technologies/dw/index.html
I'm not sure whether the DataFlux Studio is still a standalone product or not. You may want to check with your Sales Rep or with Tech Support on this.
cynthia
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure that SAS has that many built in tools (though it does have the functions that allow one to build them). However, there is lots of commercial software for doing "address normalization." I am familiar with them for US mailings (I've used this company, but there are others, http://www.melissadata.com/ ).
If you first ran all of your addresses through a program like they sell, then the matching process would be much simpler.
If you first ran all of your addresses through a program like they sell, then the matching process would be much simpler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can Proc Geocode do such address standardization/correction/normalization? I don't have SAS 9.2. TS2M3, so can't test it out.
According to some internet posts, Google MAP API can do this task. SAS has a product called SAS Google Map Generator. I wonder if this generator allows users to access Google MAP's address normalization function.
According to some internet posts, Google MAP API can do this task. SAS has a product called SAS Google Map Generator. I wonder if this generator allows users to access Google MAP's address normalization function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I believe with SAS it's either the data quality server (data flux) or you have to develop a set of Regular Expressions (which will be painfull).
Just did a quick google search.
There are sites with RegEx patterns which might help you, i.e:
http://regexlib.com/DisplayPatterns.aspx?categoryId=7&cattabindex=6
My thinking is:
You could use PRXCHANGE() to transform similar patterns to one standard string - and then compare these standard strings.
HTH
Patrick
Message was edited by: Patrick
I believe with SAS it's either the data quality server (data flux) or you have to develop a set of Regular Expressions (which will be painfull).
Just did a quick google search.
There are sites with RegEx patterns which might help you, i.e:
http://regexlib.com/DisplayPatterns.aspx?categoryId=7&cattabindex=6
My thinking is:
You could use PRXCHANGE() to transform similar patterns to one standard string - and then compare these standard strings.
HTH
Patrick
Message was edited by: Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just check online articles. It seems PROC GEOCODE's street level address geocoding only produce X/Y coordinateness and no normalized address. Can I use X/Y coordinates to match the data?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Doc and other resources for SAS Data Quality Server are on its Software Product page:
http://support.sas.com/software/products/dataqual/
SAS Data Quality server is sold as part of the SAS Data Quality Solution, and it is also available through the software offerings, SAS Data Integration Server and SAS Enterprise Data Integration Server:
http://support.sas.com/documentation/onlinedoc/dis/
http://support.sas.com/documentation/onlinedoc/entdis/
These offerings include DataFlux products that the SAS Data Quality Server interacts with.
http://support.sas.com/software/products/dataqual/
SAS Data Quality server is sold as part of the SAS Data Quality Solution, and it is also available through the software offerings, SAS Data Integration Server and SAS Enterprise Data Integration Server:
http://support.sas.com/documentation/onlinedoc/dis/
http://support.sas.com/documentation/onlinedoc/entdis/
These offerings include DataFlux products that the SAS Data Quality Server interacts with.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have done this process before and the basic procedure was to first of all separate out the parts of the address into their own fields (ie. street_address, city, state, zip). The easiest way to do the former is with regular expressions.
Secondly you will need to standardize the wording in the addresses, eg. substitute all abbreviations with the full name. In your example above, if you came across "W" or "W." or "Wst" etc in your street_address field change them all to "West". The easiest way to get a list of common abbreviations is to 'tokenize' the entire address so that you get a frequency count of the words used in the dataset. Common abbreviations like Rd, Ln, St etc will bubble to the top. You then manually make a mapping using whatever technique you like best.
Lastly you can use the SAS soundex() function to identify addresses that are the same but may contain typos or misspellings. Ie. Main Street, Main Streat, Maine Street would all be considered the same using the soundex() function. When you have a match on say the soundex(street_address) + zip + name you can be reasonably certain that it is the same address even when they have misspellings and/or typos.
Hope this helps.
Cheers
Rob
Secondly you will need to standardize the wording in the addresses, eg. substitute all abbreviations with the full name. In your example above, if you came across "W" or "W." or "Wst" etc in your street_address field change them all to "West". The easiest way to get a list of common abbreviations is to 'tokenize' the entire address so that you get a frequency count of the words used in the dataset. Common abbreviations like Rd, Ln, St etc will bubble to the top. You then manually make a mapping using whatever technique you like best.
Lastly you can use the SAS soundex() function to identify addresses that are the same but may contain typos or misspellings. Ie. Main Street, Main Streat, Maine Street would all be considered the same using the soundex() function. When you have a match on say the soundex(street_address) + zip + name you can be reasonably certain that it is the same address even when they have misspellings and/or typos.
Hope this helps.
Cheers
Rob