BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jrrwactuary
Fluorite | Level 6

I am attempting to clean a variable called Member_Address where the goal is to take extract on the street name and number without including any suffixes such as apartment number. In the code chunk below, I created a sample of 4 addresses where I am running into some issues and an example of 1 address, "45 GOODE ST#4", that is being cleaned correctly.

DATA sample;
	length Member_Address $250;
	input Member_Address & $; *addresses are fictional representations of member data, but modeled after real issues;
	datalines;
45 GOODE ST#4 123 HUNTINGTON RD.UNT#4 45 LANCELOT DR. LOT 36 1314 NOVA RD RALEIGH, NC 76616 PO BOX 18239232 1516 FOURTH AVE ; DATA clean_sample; format Member_Address Orig_Member_Address; set sample; if Member_Address = '' then delete; Orig_Member_Address = Member_Address; if find(Member_Address,"ACP") then delete; if prxmatch("/(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)/",Member_Address) > 0 then Member_Address = substr(Member_Address,1,length(Member_Address)-(length(Member_Address)-prxmatch("/(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)/",Member_Address))-1); else if prxmatch("/(([a-zA-Z])|(\.))[0-9]/",Member_Address) > 0 then Member_Address = substr(Member_Address,1,length(Member_Address)-(length(Member_Address)-prxmatch("/(([a-zA-Z])|(\.))[0-9]/",Member_Address))-1); RUN;

The output showing how only the first address is cleaned correctly is:

jrrwactuary_1-1630602220416.png

The first "if" statement in the code removes  blank addresses, the second "if" removes addresses that fall into the Address Confidentiality Program (APC), the third find patterns of known address suffixes and truncates the address at that point, and the last truncates addresses in the form "123 FAKE RD5" or "123 FAKE RD.5".

 

Going off the logic above, for example, the two addresses: "1003 GOODWIN RDAPT 6" and "65 AIRPORT RDPO BOX 1233432" would turn into: "1003 GOODWIN RD" and "65 AIRPORT RD". The issues I'm running into fall into the following scenarios where my pattern matching logic falls apart:

  1. An apartment/unit/etc. suffix is present in the street name (e.g. street name "LANCELOT" has "LOT" in the name, but also a "LOT" suffix)
  2. The full address is in the line including city, state, and zip code
  3. The address begins with "PO BOX" or "POBOX"

I was hoping that someone might be able to point me in the right direction for what patterns I can add to make sure addresses like the ones shown above get cleaned, or if there is a better way to go about cleaning data like this.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

First you need to decide how to handle the strings you find.

For a string such as  "45 LANCELOT DRLOT 36" a human intervention might be necessary, unless you try to detect all combinations of (DR|ST|AVE)+(LOT|APT|UNIT) values.

 

For simpler cases such as "45 LANCELOT DR.LOT 36" use the \b word boundary anchor.

POS = prxmatch("/\b(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)\b/", MEMBER_ADDRESS) ;
if POS > 0 then MEMBER_ADDRESS = substr(MEMBER_ADDRESS, 1, POS);

 

Cleaning addresses is a very iterative process: Add some logic, see the effect, see what's missing, add more logic.

Having a reference table of addresses, and splitting the address into its components, helps a lot as it gives you a reference. You can see how much of your addresses match against the reference table.

View solution in original post

5 REPLIES 5
SASKiwi
PROC Star

How many addresses are you trying to clean? If it is a few thousand then a DIY approach is feasible. If we are talking millions then I'd suggest using software with built-in address-cleaning capabilities is a better approach.

 

SAS Data Quality provides this. It includes an address knowledge base for the country whose data you wish to work on. It also goes one step further and validates addresses to flag if they are genuine or not. This is very useful for mailing addresses. Obviously this option is expensive if you don't currently have the software and it is more suited to providing company-wide data quality improvement capabilities.

 

jrrwactuary
Fluorite | Level 6

The code will be used to clean hundreds to thousands of addresses from a variety of client-specific files, so the different cases for wrong addresses could quickly balloon up.

 

I may have found a potential solution, however, courtesy of Roger DeAngelis (https://github.com/rogerjdeangelis/utl_US_address-standardization). If I understand it correctly, the code will split an address into its component pieces. What I am thinking of doing is using a similar pattern matching to my original post, but instead inserting white spaces where the address components are pushed together (i.e. "123 MAIN STAPT4" turns into "123 MAIN ST APT4"). From there I was going to use the code I linked to grab the address components and then push the results together, which I believe would be {"123","MAIN","ST","APT4"} --> catx(' ',"123","MAIN","ST") --> "123 MAIN ST".

jrrwactuary
Fluorite | Level 6
Actually, scratch that. The code still falls apart when an apartment/unit/etc. suffix is present in the street name (e.g. street name "LANCELOT" has "LOT" in the name, but also a "LOT" suffix) as a white space would be inserted in the wrong places for an address such as "45 LANCELOT DR.LOT 36", which would turn into "45 LANCE LOT DR. LOT 36". I can still attempt to run the code I linked in the previous response and see how it handles the addresses.
ChrisNZ
Tourmaline | Level 20

First you need to decide how to handle the strings you find.

For a string such as  "45 LANCELOT DRLOT 36" a human intervention might be necessary, unless you try to detect all combinations of (DR|ST|AVE)+(LOT|APT|UNIT) values.

 

For simpler cases such as "45 LANCELOT DR.LOT 36" use the \b word boundary anchor.

POS = prxmatch("/\b(APT|APARTMENT|LOT|NONE|#|UNIT|UNT|ATTN|PO BOX|POBOX|STE|SUITE|BUILDING|BLDG|`)\b/", MEMBER_ADDRESS) ;
if POS > 0 then MEMBER_ADDRESS = substr(MEMBER_ADDRESS, 1, POS);

 

Cleaning addresses is a very iterative process: Add some logic, see the effect, see what's missing, add more logic.

Having a reference table of addresses, and splitting the address into its components, helps a lot as it gives you a reference. You can see how much of your addresses match against the reference table.

jrrwactuary
Fluorite | Level 6

Apologies as I took a break from this problem. But you were right, it ended up being just a very iterative process. I found some different pattern combinations that worked, but it was not quite the general cleaning I was hoping it to be.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 3918 views
  • 0 likes
  • 3 in conversation