I am attempting to clean a variable called Member_Address where the goal is to take extract on the street name and number without including any suffixes such as apartment number. In the code chunk below, I created a sample of 4 addresses where I am running into some issues and an example of 1 address, "45 GOODE ST#4", that is being cleaned correctly. DATA sample;
length Member_Address $250;
input Member_Address & $; *addresses are fictional representations of member data, but modeled after real issues;
datalines; 45 GOODE ST#4
123 HUNTINGTON RD.UNT#4
45 LANCELOT DR. LOT 36
1314 NOVA RD RALEIGH, NC 76616
PO BOX 18239232 1516 FOURTH AVE
;
DATA clean_sample;
format Member_Address Orig_Member_Address;
set sample;
if Member_Address = '' then delete;
Orig_Member_Address = Member_Address;
if find(Member_Address,"ACP") then delete;
if prxmatch("/(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)/",Member_Address) > 0
then Member_Address = substr(Member_Address,1,length(Member_Address)-(length(Member_Address)-prxmatch("/(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)/",Member_Address))-1);
else if prxmatch("/(([a-zA-Z])|(\.))[0-9]/",Member_Address) > 0
then Member_Address = substr(Member_Address,1,length(Member_Address)-(length(Member_Address)-prxmatch("/(([a-zA-Z])|(\.))[0-9]/",Member_Address))-1);
RUN; The output showing how only the first address is cleaned correctly is: The first "if" statement in the code removes blank addresses, the second "if" removes addresses that fall into the Address Confidentiality Program (APC), the third find patterns of known address suffixes and truncates the address at that point, and the last truncates addresses in the form "123 FAKE RD5" or "123 FAKE RD.5". Going off the logic above, for example, the two addresses: "1003 GOODWIN RDAPT 6" and "65 AIRPORT RDPO BOX 1233432" would turn into: "1003 GOODWIN RD" and "65 AIRPORT RD". The issues I'm running into fall into the following scenarios where my pattern matching logic falls apart: An apartment/unit/etc. suffix is present in the street name (e.g. street name "LANCELOT" has "LOT" in the name, but also a "LOT" suffix) The full address is in the line including city, state, and zip code The address begins with "PO BOX" or "POBOX" I was hoping that someone might be able to point me in the right direction for what patterns I can add to make sure addresses like the ones shown above get cleaned, or if there is a better way to go about cleaning data like this.
... View more