Solved: How can I speed up address parsing that is taking too long?

Mike_B · Posted 07-22-2020 11:25 AM

I have a data set of 3k observations with a "city, state, zip" free-text variable named Residence which I need to break out into three separate columns. Because many of the addresses are garbage, I've resorted to joining it to sashelp.zipcode on a fuzzy match and then using regular expressions to identify city, state and zip matches. The join creates a data set of 9,866,133 records. Parsing the set takes days. What makes things worse is my set of 3k records is just a small subset of the set I will eventually need to parse monthly. Does anyone have any tips on how to speed things up? I'm using 64-bit base SAS 9.4M6, no fancy data-mining packages. I've tried running the program on my machine versus server SAS, and it takes a million years either way.

data Residences;
length Residence $50;
do Residence=
"MIAMI GARDENS/FL/33055",
"SARASOTA FL/ZIP)",
"TITUSVILLE, FL)",
"ADJUNTAS PRTE/ZIP)",
"OPA LOCKA/TARPON SPRINGS",
"SAINT PETERSBURG/FL/33710",
"DELAND,FL 32720)",
"DORAL, FL 33172)",
"MIAMI FL 33168P)",
"ADEL, GA 31620)",
"OSTRICA, LA 70041)",
"KINGSLAND, GA )",
"MILTON FL 32570)",
"LARGO,FL 32773 )",
"FORT PIERCE/FL/34945",
"+IAMI,FL",
"HOMELESS (850)",
"SABASTIAN, FLZIP)",
"PLANT CITY/TAMPA AREA",
"BUSHNELL, FZIP)"
;
output;
end;
run;

proc sql;
create table ParseThese as
select distinct city
	       ,statecode
	       ,zip
               ,rp.Residence
from sashelp.zipcode right join Residences as rp
on find(compress(Residence),statecode,'it')
or find(compress(Residence),stname(statecode),'it')
or find(compress(Residence),compress(city),'it')
or find(compress(Residence),cats(put(zip,z5.)),'it')
order by Residence;
quit;

*parse address;
data Parsed;
set ParseThese;
if anyalpha(Residence) then do;
   CityPattern=cats("/\b",upcase(city),"\b/");
   StatePattern=cats("/\b",upcase(statecode),"\b|\b",stname(statecode),"\b/");
   CityParser=prxparse(CityPattern);
   StateParser=prxparse(StatePattern);
   CityMatch=ifn(missing(Residence)=missing(city)=0,prxmatch(CityParser,compbl(Residence)),0);
   StateMatch=ifn(missing(Residence)=missing(statecode)=0,prxmatch(StateParser,compbl(Residence)),0);
end;
if anydigit(Residence) then do;
   ZipPattern=cats("/\b",put(zip,z5.),"\b/");
   ZipParser=prxparse(ZipPattern);
   ZipMatch=ifn(missing(Residence)=missing(zip)=0,prxmatch(ZipParser,compbl(Residence)),0);
end;
if missing(CityMatch) then CityMatch=0;
if missing(StateMatch) then StateMatch=0;
if missing(ZipMatch) then ZipMatch=0;
run;

ballardw · Posted 08-01-2020 10:04 PM

@Mike_B wrote:

@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.

You may want to learn about the functions ZIPCITY, which will return the name of a city and two letter state code, ZIPNAME, which returns upper case state or territory name, ZIPNAMEL, mixed case state or territory and ZIPSTATE, two character state code instead of a whole bunch of hashing.

Just because the ZIP doesn't match the city you are assuming the Zip is wrong? My experience with poor data entry is that about 30% of the time the City is wrong if the state matches the Zip. There is also an obnoxious behavior in some rural areas that the nearest post office is in a different state than the residence.

Poor data is just that, poor. And think of tasks like this as "job security". That's the only way I get through some of the junk I have. I have had "addresses" that included the pithy "See the mother", and "watch out for dogs" as part of the street address that I was expected to geocode for mapping.

View solution in original post

ballardw · Posted 07-22-2020 12:07 PM

If I were doing something along those lines I would fuzzy match on zipcode only. Identify the records without zipcode and then match them on one or more of the remaining bits starting with the city and state (both) in some form. Then those that didn't match go into the garbage hopper which would need more work.

That should reduce the number of records a whole bunch.

proc sql;
   create table ZIPMATCH as
   select distinct city
   	       ,statecode
   	       ,zip
                  ,rp.Residence
   from sashelp.zipcode right join Residences as rp
   on find(compress(Residence),cats(put(zip,z5.)),'it')
   where not missing (zip)
   order by Residence
   ;
quit;

Then create a set from Residence that removes any of the records in ZIPMATCH. Look at the SQL Except such as

proc sql;
   create table notzip as
   select Residence from  Residences
   except
   select Residence from Zipmatch
   ;
quit;

Looking for matches of City and State, or statecode I strongly suggest using FINDW instead of find as you'll get the AL statecode in the City value of Alhambra and lots of similar.

I would even be tempted to only use the state code first.

Similar, remove the city and state matches.

Then a city match only, at this point I would add a text box to manually mark a preferred "match" as you may get multiple. (VIEWTABLE lib.dsname openmode=edit; ) in the command box or similar.

Mike_B · Posted 08-01-2020 12:37 PM

@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.

I ended up using a hash, which I've never done before, but (after many versions) it seems to be working.

data Residences;
length Residence $50;
do Residence=
"MIAMI GARDENS/FL/33055",
"SARASOTA FL/ZIP)",
"TITUSVILLE, FL)",
"ADJUNTAS PRTE/ZIP)",
"OPA LOCKA/TARPON SPRINGS",
"SAINT PETERSBURG/FL/33710",
"DELAND,FL 32720)",
"DORAL, FL 33172)",
"MIAMI FL 33168P)",
"ADEL, GA 31620)",
"OSTRICA, LA 70041)",
"KINGSLAND, GA )",
"MILTON FL 32570)",
"LARGO,FL 32773 )",
"FORT PIERCE/FL/34945",
"+IAMI,FL",
"HOMELESS (850)",
"SABASTIAN, FLZIP)",
"PLANT CITY/TAMPA AREA",
"BUSHNELL, FZIP)"
;
output;
end;
run;

proc sql;
create table ParseThese as 
select distinct find(compress(Residence),compress(city),'i') as CityFound
,case when find(compress(Residence),compress(stname(statecode)),'i') then 1
      when findw(Residence,statecode,'z','aki') then 2
      else 0
end as StateFound
,find(compress(Residence),cats(put(zip,z5.))) as ZipFound
,rp.Residence
,city
,statecode
,zip
from sashelp.zipcode,Residences as rp
where findW(Residence,statecode,'z','aki')
      or find(compress(Residence),compress(stname(statecode)),'i')
      or find(compress(Residence),compress(city),'i')
      or find(compress(Residence),cats(put(zip,z5.)))
order by Residence,city,statecode,zip,CityFound,StateFound,ZipFound
;
quit;

data Parsed;
length Residence $50 CITY $35 STATECODE $2 ZIP 8 CityFound StateFound ZipFound 8 CityPattern StatePattern ZipPattern $250 CityParser StateParser ZipParser 8 CityMatch StateMatch ZipMatch 8;

if _n_ = 1 then do;
declare hash myhash(dataset:'ParseThese', multidata: 'y', ordered: 'y');
rc = myhash.DEFINEKEY('Residence','City','StateCode','ZIP');
rc = myhash.DEFINEDATA('Residence','CityFound','StateFound','ZipFound','City','StateCode','ZIP');
rc = myhash.DEFINEDONE();
CALL MISSING(Residence,CityFound,StateFound,ZipFound,City,StateCode,ZIP,
CityPattern,CityParser,CityMatch,StatePattern,StateParser,StateMatch,ZipPattern,ZipParser,ZipMatch);
DECLARE hiter myiter('myhash');
end;

rc = myiter.first();
do while (rc eq 0);
if CityFound then do;
   CityPattern=cats("/\b",upcase(city),"\b/");
   CityParser=prxparse(CityPattern);
   CityMatch=ifn(missing(Residence)=missing(city)=0,prxmatch(CityParser,compbl(Residence)),0);
   call prxfree(CityParser);
end;
else call missing(CityPattern,CityParser,CityMatch);
if StateFound then do;
   StatePattern=cats("/\b",upcase(statecode),"\b|\b",stname(statecode),"\b/");
   StateParser=prxparse(StatePattern);
   StateMatch=ifn(missing(Residence)=missing(statecode)=0,prxmatch(StateParser,compbl(Residence)),0);
   call prxfree(StateParser);
end;
else call missing(StatePattern,StateParser,StateMatch);
if ZipFound then do;
   ZipPattern=cats("/\b",put(zip,z5.),"\b/");
   ZipParser=prxparse(ZipPattern);
   ZipMatch=ifn(missing(Residence)=missing(zip)=0,prxmatch(ZipParser,compbl(Residence)),0);
   call prxfree(ZipParser);
end;
else call missing(ZipPattern,ZipParser,ZipMatch);
if sum(CityMatch,StateMatch,ZipMatch) then output;
rc = myiter.next();
end;
run;

ballardw · Posted 08-01-2020 10:04 PM

@Mike_B wrote:

@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.

You may want to learn about the functions ZIPCITY, which will return the name of a city and two letter state code, ZIPNAME, which returns upper case state or territory name, ZIPNAMEL, mixed case state or territory and ZIPSTATE, two character state code instead of a whole bunch of hashing.

Just because the ZIP doesn't match the city you are assuming the Zip is wrong? My experience with poor data entry is that about 30% of the time the City is wrong if the state matches the Zip. There is also an obnoxious behavior in some rural areas that the nearest post office is in a different state than the residence.

Poor data is just that, poor. And think of tasks like this as "job security". That's the only way I get through some of the junk I have. I have had "addresses" that included the pithy "See the mother", and "watch out for dogs" as part of the street address that I was expected to geocode for mapping.

Mike_B · Posted 07-09-2021 12:31 PM

Thanks for the suggestions. I just got back to this project and wrapped it up and ended up using zipcity and zipstate as you suggested.

Patrick · Posted 08-01-2020 08:13 PM

If you run below code...

proc setinit;
run;

...do you see in the SAS log anywhere?

DataFlux

DataFlux is the SAS DQ module which would allow you to tokenize and standardize address strings which would make your job much easier.

If you haven't got this SAS module and given your sample data then also consider to use the services of a specialized external provider to clean-up and standardize your address data.

Mike_B · Posted 07-09-2021 11:58 AM

Thank you for the suggestion. Unfortunately, we do not have DataFlux. I just got back to this project a couple weeks ago and was able to complete it by processing the set 1 million records at a time in a do-loop.

How can I speed up address parsing that is taking too long?

Re: How can I speed up address parsing that is taking too long with regular expressions?

Re: How can I speed up address parsing that is taking too long?

Re: How can I speed up address parsing that is taking too long with regular expressions?

Re: How can I speed up address parsing that is taking too long with regular expressions?

Re: How can I speed up address parsing that is taking too long with regular expressions?

Re: How can I speed up address parsing that is taking too long?

Re: How can I speed up address parsing that is taking too long?

Classroom Training Available!