Hi SAS Community,
Like others before me, I am faced with cleaning addresses without the likes of SAS Data Quality. I've already applied prxparse to my dataset to the extent I could and encountered a few additional situations that I'd like to clean but unsure how to go about this (if at all). Fortunate for my case (I suppose?), I have historical addresses or several years of data. I wanted to see if I could apply 2 more steps to my data but unsure about how to go about this or if this would open me up to additional problems that I'm just not seeing:
1. I want to harmonize my data across years; for example, I have the same address listed differently (example below)
2. If they share the same street number from year to year, I can assume it's the same address. From those I've encountered, it's the case (example below)
Data NewData;
INFILE DATALINES DSD;
input id $ name $ addy ~$25. city $ state $ zip $ cleaned_addy ~$25. year $;
DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2011
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2012
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN,2015
;
run;
In the example below, since they start with the same street number for the same id, they are the same. Similarly, the address is harmonized across year accounting for differences in how the address is stated.
Data WANT;
INFILE DATALINES DSD;
input id $ name $ addy ~$25. city $ state $ zip $ cleaned_addy ~$25. year $;
DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN STREET,2011
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN STREET,2012
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN STREET,2015
;
run;
Any guidance on steps I can apply to achieve this or potential issues would be much appreciated!
Just for your simple example code like below should do.
Data NewData;
INFILE DATALINES DSD;
input id $ name $ addy ~$25. city $ state $ zip $ addy_want ~$25. year $;
DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2011
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2012
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN,2015
;
data standardize;
set NewData;
addy_stdz=compbl(upcase(addy));
addy_stdz=prxchange('s/^(.*?)\d*$/$1/i',1,strip(addy_stdz));
addy_stdz=prxchange('s/^(.*?)\.? *st$/$1 STREET/i',1,strip(addy_stdz));
addy_stdz=prxchange('s/^(.*?)\.? *ave$/$1 AVENUE/i',1,strip(addy_stdz));
run;
Address cleansing gets quickly very involved and you need to find the balance between effort and quality you need.
If this is a one-off with a not to high data volume then eventually some programmatical standardization and then some manual checks with further amendments are quickest.
If it's a one-off with a bigger data volume then consider using some specialized 3rd party provider for address cleansing (besides of rule based approaches such providers also use address databases against which they can verify addresses).
If it's something you will need to do regularly for high data volumes then it's may-be worth to consider the purchase of specialized software (like SAS DataQuality/DataFlux) or to contract with a 3rd party provider.
1) extract constant parts into variables
E.g., if find (string, “Mobile”) then city = Mobile;
2) group by those constant variables*
*ideally take those constant parts out of the strings, e.g., using COMPRESS, just keep them in variables
3) create variable “length” with length of each string
4) sort by length descending and chose the longest string in each group (first.length) as your “standard”
5) compare strings in each group to its “standard” e.g., using COMPGED function*
There are other functions that can compare two strings in SAS, COMPLEV, etc, for fuzzy matching
6) the function will return distance scores
7) addresses within reasonable distance of the standard are the same
What is reasonable distance? You will have to find out experimentally - e.g., take one with score of 100 and see if it seems the same.
What rule do you apply to decide what is the final cleaned address version? How do you know 123 MAIN HIGHWAY is the same address as 123 MAIN STREET?
FYI, I use SAS Data Quality for cleaning and matching addresses. In SAS DQ this is called address standardization. In my experience, you really need a separate reference source of addresses to compare against when tidying up addresses. Guessing a correct address just within your own data might work for minor misspellings but I don't think it will work accurately for streets with multiple names.
Just for your simple example code like below should do.
Data NewData;
INFILE DATALINES DSD;
input id $ name $ addy ~$25. city $ state $ zip $ addy_want ~$25. year $;
DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2011
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2012
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN,2015
;
data standardize;
set NewData;
addy_stdz=compbl(upcase(addy));
addy_stdz=prxchange('s/^(.*?)\d*$/$1/i',1,strip(addy_stdz));
addy_stdz=prxchange('s/^(.*?)\.? *st$/$1 STREET/i',1,strip(addy_stdz));
addy_stdz=prxchange('s/^(.*?)\.? *ave$/$1 AVENUE/i',1,strip(addy_stdz));
run;
Address cleansing gets quickly very involved and you need to find the balance between effort and quality you need.
If this is a one-off with a not to high data volume then eventually some programmatical standardization and then some manual checks with further amendments are quickest.
If it's a one-off with a bigger data volume then consider using some specialized 3rd party provider for address cleansing (besides of rule based approaches such providers also use address databases against which they can verify addresses).
If it's something you will need to do regularly for high data volumes then it's may-be worth to consider the purchase of specialized software (like SAS DataQuality/DataFlux) or to contract with a 3rd party provider.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.