Solved: Re: Street Address Data Cleaning Using prxmatch

jrrwactuary · Posted 09-02-2021 01:11 PM

I am attempting to clean a variable called Member_Address where the goal is to take extract on the street name and number without including any suffixes such as apartment number. In the code chunk below, I created a sample of 4 addresses where I am running into some issues and an example of 1 address, "45 GOODE ST#4", that is being cleaned correctly.

DATA sample;
	length Member_Address $250;
	input Member_Address & $; *addresses are fictional representations of member data, but modeled after real issues;
	datalines;
45 GOODE ST#4
123 HUNTINGTON RD.UNT#4
45 LANCELOT DR. LOT 36
1314 NOVA RD RALEIGH, NC 76616
PO BOX 18239232 1516 FOURTH AVE
;

DATA clean_sample;
	format Member_Address Orig_Member_Address;
	set sample;
	if Member_Address = '' then delete;
	Orig_Member_Address = Member_Address;
	if find(Member_Address,"ACP") then delete;
	if prxmatch("/(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)/",Member_Address) > 0
		then Member_Address = substr(Member_Address,1,length(Member_Address)-(length(Member_Address)-prxmatch("/(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)/",Member_Address))-1);
	else if prxmatch("/(([a-zA-Z])|(\.))[0-9]/",Member_Address) > 0 
		then Member_Address = substr(Member_Address,1,length(Member_Address)-(length(Member_Address)-prxmatch("/(([a-zA-Z])|(\.))[0-9]/",Member_Address))-1);
RUN;

The output showing how only the first address is cleaned correctly is:

The first "if" statement in the code removes blank addresses, the second "if" removes addresses that fall into the Address Confidentiality Program (APC), the third find patterns of known address suffixes and truncates the address at that point, and the last truncates addresses in the form "123 FAKE RD5" or "123 FAKE RD.5".

Going off the logic above, for example, the two addresses: "1003 GOODWIN RDAPT 6" and "65 AIRPORT RDPO BOX 1233432" would turn into: "1003 GOODWIN RD" and "65 AIRPORT RD". The issues I'm running into fall into the following scenarios where my pattern matching logic falls apart:

An apartment/unit/etc. suffix is present in the street name (e.g. street name "LANCELOT" has "LOT" in the name, but also a "LOT" suffix)
The full address is in the line including city, state, and zip code
The address begins with "PO BOX" or "POBOX"

I was hoping that someone might be able to point me in the right direction for what patterns I can add to make sure addresses like the ones shown above get cleaned, or if there is a better way to go about cleaning data like this.

ChrisNZ · Posted 09-02-2021 07:36 PM

First you need to decide how to handle the strings you find.

For a string such as "45 LANCELOT DRLOT 36" a human intervention might be necessary, unless you try to detect all combinations of (DR|ST|AVE)+(LOT|APT|UNIT) values.

For simpler cases such as "45 LANCELOT DR.LOT 36" use the \b word boundary anchor.

POS = prxmatch("/\b(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)\b/", MEMBER_ADDRESS) ;
if POS > 0 then MEMBER_ADDRESS = substr(MEMBER_ADDRESS, 1, POS);

Cleaning addresses is a very iterative process: Add some logic, see the effect, see what's missing, add more logic.

Having a reference table of addresses, and splitting the address into its components, helps a lot as it gives you a reference. You can see how much of your addresses match against the reference table.

High-Performance SAS Coding - Third Edition

View solution in original post

SASKiwi · Posted 09-02-2021 06:08 PM

How many addresses are you trying to clean? If it is a few thousand then a DIY approach is feasible. If we are talking millions then I'd suggest using software with built-in address-cleaning capabilities is a better approach.

SAS Data Quality provides this. It includes an address knowledge base for the country whose data you wish to work on. It also goes one step further and validates addresses to flag if they are genuine or not. This is very useful for mailing addresses. Obviously this option is expensive if you don't currently have the software and it is more suited to providing company-wide data quality improvement capabilities.

jrrwactuary · Posted 09-02-2021 06:48 PM

The code will be used to clean hundreds to thousands of addresses from a variety of client-specific files, so the different cases for wrong addresses could quickly balloon up.

I may have found a potential solution, however, courtesy of Roger DeAngelis (https://github.com/rogerjdeangelis/utl_US_address-standardization). If I understand it correctly, the code will split an address into its component pieces. What I am thinking of doing is using a similar pattern matching to my original post, but instead inserting white spaces where the address components are pushed together (i.e. "123 MAIN STAPT4" turns into "123 MAIN ST APT4"). From there I was going to use the code I linked to grab the address components and then push the results together, which I believe would be {"123","MAIN","ST","APT4"} --> catx(' ',"123","MAIN","ST") --> "123 MAIN ST".

jrrwactuary · Posted 09-02-2021 06:58 PM

Actually, scratch that. The code still falls apart when an apartment/unit/etc. suffix is present in the street name (e.g. street name "LANCELOT" has "LOT" in the name, but also a "LOT" suffix) as a white space would be inserted in the wrong places for an address such as "45 LANCELOT DR.LOT 36", which would turn into "45 LANCE LOT DR. LOT 36". I can still attempt to run the code I linked in the previous response and see how it handles the addresses.

ChrisNZ · Posted 09-02-2021 07:36 PM

First you need to decide how to handle the strings you find.

For a string such as "45 LANCELOT DRLOT 36" a human intervention might be necessary, unless you try to detect all combinations of (DR|ST|AVE)+(LOT|APT|UNIT) values.

For simpler cases such as "45 LANCELOT DR.LOT 36" use the \b word boundary anchor.

POS = prxmatch("/\b(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)\b/", MEMBER_ADDRESS) ;
if POS > 0 then MEMBER_ADDRESS = substr(MEMBER_ADDRESS, 1, POS);

Cleaning addresses is a very iterative process: Add some logic, see the effect, see what's missing, add more logic.

Having a reference table of addresses, and splitting the address into its components, helps a lot as it gives you a reference. You can see how much of your addresses match against the reference table.

High-Performance SAS Coding - Third Edition

jrrwactuary · Posted 09-14-2021 12:26 PM

Apologies as I took a break from this problem. But you were right, it ended up being just a very iterative process. I found some different pattern combinations that worked, but it was not quite the general cleaning I was hoping it to be.

SAS Innovate 2025: Call for Content

Classroom Training Available!