SAS Programming

A_Swoosh · Posted 08-11-2022 07:54 PM

Hi SAS Community,

Like others before me, I am faced with cleaning addresses without the likes of SAS Data Quality. I've already applied prxparse to my dataset to the extent I could and encountered a few additional situations that I'd like to clean but unsure how to go about this (if at all). Fortunate for my case (I suppose?), I have historical addresses or several years of data. I wanted to see if I could apply 2 more steps to my data but unsure about how to go about this or if this would open me up to additional problems that I'm just not seeing:

1. I want to harmonize my data across years; for example, I have the same address listed differently (example below)

They share the same city,state,zip code
They appear to share the same address too but stated differently

2. If they share the same street number from year to year, I can assume it's the same address. From those I've encountered, it's the case (example below)

Data NewData;
INFILE DATALINES DSD;
input id $  name $ addy ~$25. city $ state $ zip $ cleaned_addy ~$25. year $;
DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2011 
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2012 
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013 
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014 
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN,2015 
;
run;

In the example below, since they start with the same street number for the same id, they are the same. Similarly, the address is harmonized across year accounting for differences in how the address is stated.

Data WANT;
INFILE DATALINES DSD;
input id $  name $ addy ~$25. city $ state $ zip $ cleaned_addy ~$25. year $;
DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN STREET,2011 
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN STREET,2012 
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013 
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014 
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN STREET,2015 
;
run;

Any guidance on steps I can apply to achieve this or potential issues would be much appreciated!

Patrick · Posted 08-13-2022 07:43 PM

Just for your simple example code like below should do.

Data NewData;
  INFILE DATALINES DSD;
  input id $  name $ addy ~$25. city $ state $ zip $ addy_want ~$25. year $;
  DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2011 
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2012 
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013 
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014 
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN,2015 
;

data standardize;
  set NewData;
  addy_stdz=compbl(upcase(addy));
  addy_stdz=prxchange('s/^(.*?)\d*$/$1/i',1,strip(addy_stdz));
  addy_stdz=prxchange('s/^(.*?)\.? *st$/$1 STREET/i',1,strip(addy_stdz));
  addy_stdz=prxchange('s/^(.*?)\.? *ave$/$1 AVENUE/i',1,strip(addy_stdz));
run;

Address cleansing gets quickly very involved and you need to find the balance between effort and quality you need.

If this is a one-off with a not to high data volume then eventually some programmatical standardization and then some manual checks with further amendments are quickest.

If it's a one-off with a bigger data volume then consider using some specialized 3rd party provider for address cleansing (besides of rule based approaches such providers also use address databases against which they can verify addresses).

If it's something you will need to do regularly for high data volumes then it's may-be worth to consider the purchase of specialized software (like SAS DataQuality/DataFlux) or to contract with a 3rd party provider.

View solution in original post

pink_poodle · Posted 08-11-2022 10:54 PM

1) extract constant parts into variables
E.g., if find (string, “Mobile”) then city = Mobile;
2) group by those constant variables*
*ideally take those constant parts out of the strings, e.g., using COMPRESS, just keep them in variables
3) create variable “length” with length of each string
4) sort by length descending and chose the longest string in each group (first.length) as your “standard”
5) compare strings in each group to its “standard” e.g., using COMPGED function*
There are other functions that can compare two strings in SAS, COMPLEV, etc, for fuzzy matching
6) the function will return distance scores
7) addresses within reasonable distance of the standard are the same
What is reasonable distance? You will have to find out experimentally - e.g., take one with score of 100 and see if it seems the same.

JosvanderVelden · Posted 08-12-2022 05:19 AM

Have you seen these papers:
https://analytics.ncsu.edu/sesug/2008/CC-028.pdf
https://www.lexjansen.com/scsug/2008/app/Pramod_Sambidi.pdf

SASKiwi · Posted 08-12-2022 07:46 PM

What rule do you apply to decide what is the final cleaned address version? How do you know 123 MAIN HIGHWAY is the same address as 123 MAIN STREET?

FYI, I use SAS Data Quality for cleaning and matching addresses. In SAS DQ this is called address standardization. In my experience, you really need a separate reference source of addresses to compare against when tidying up addresses. Guessing a correct address just within your own data might work for minor misspellings but I don't think it will work accurately for streets with multiple names.

A_Swoosh · Posted 08-13-2022 01:42 AM

Yea, I've heard of SAS Data Quality for cleaning addresses but unfortunately it's not an option for me so I'm left to utilize pxparse regex to standardize my addresses.

So, to get from 123 MAIN HIGHWAY to 123 MAIN STREET, I make an assumption that if the street number is the same, apply 1 address to all (preferably the latest year) since I have multiple years of data for the same store. It's some small chance that a store moves to another address that is the same street number but highly unlikely.

Patrick · Posted 08-13-2022 07:43 PM

Just for your simple example code like below should do.

Data NewData;
  INFILE DATALINES DSD;
  input id $  name $ addy ~$25. city $ state $ zip $ addy_want ~$25. year $;
  DATALINES;
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2011 
01,ABCSTORE,123 MAIN HIGHWAY 75,MOBILE,AL,36619,123 MAIN HIGHWAY,2012 
01,ABCSTORE,123 MAIN STREET,MOBILE,AL,36619,123 MAIN STREET,2013 
01,ABCSTORE,123 MAIN ST,MOBILE,AL,36619,123 MAIN STREET,2014 
01,ABCSTORE,123 MAIN,MOBILE,AL,36619,123 MAIN,2015 
;

data standardize;
  set NewData;
  addy_stdz=compbl(upcase(addy));
  addy_stdz=prxchange('s/^(.*?)\d*$/$1/i',1,strip(addy_stdz));
  addy_stdz=prxchange('s/^(.*?)\.? *st$/$1 STREET/i',1,strip(addy_stdz));
  addy_stdz=prxchange('s/^(.*?)\.? *ave$/$1 AVENUE/i',1,strip(addy_stdz));
run;

Address cleansing gets quickly very involved and you need to find the balance between effort and quality you need.

If this is a one-off with a not to high data volume then eventually some programmatical standardization and then some manual checks with further amendments are quickest.

If it's a one-off with a bigger data volume then consider using some specialized 3rd party provider for address cleansing (besides of rule based approaches such providers also use address databases against which they can verify addresses).

If it's something you will need to do regularly for high data volumes then it's may-be worth to consider the purchase of specialized software (like SAS DataQuality/DataFlux) or to contract with a 3rd party provider.

SAS Programming

Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Follow Us

What is...

SAS Programming

Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Re: Harmonizing Address

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...