BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mike_B
Obsidian | Level 7

I have a data set of 3k observations with a "city, state, zip" free-text variable named Residence which I need to break out into three separate columns. Because many of the addresses are garbage, I've resorted to joining it to sashelp.zipcode on a fuzzy match and then using regular expressions to identify city, state and zip matches. The join creates a data set of 9,866,133 records. Parsing the set takes days. What makes things worse is my set of 3k records is just a small subset of the set I will eventually need to parse monthly. Does anyone have any tips on how to speed things up? I'm using 64-bit base SAS 9.4M6, no fancy data-mining packages. I've tried running the program on my machine versus server SAS, and it takes a million years either way.

 

data Residences;
length Residence $50;
do Residence=
"MIAMI GARDENS/FL/33055",
"SARASOTA FL/ZIP)",
"TITUSVILLE, FL)",
"ADJUNTAS PRTE/ZIP)",
"OPA LOCKA/TARPON SPRINGS",
"SAINT PETERSBURG/FL/33710",
"DELAND,FL 32720)",
"DORAL, FL 33172)",
"MIAMI FL 33168P)",
"ADEL, GA 31620)",
"OSTRICA, LA 70041)",
"KINGSLAND, GA )",
"MILTON FL 32570)",
"LARGO,FL 32773 )",
"FORT PIERCE/FL/34945",
"+IAMI,FL",
"HOMELESS (850)",
"SABASTIAN, FLZIP)",
"PLANT CITY/TAMPA AREA",
"BUSHNELL, FZIP)"
;
output;
end;
run;

proc sql;
create table ParseThese as
select distinct city
	       ,statecode
	       ,zip
               ,rp.Residence
from sashelp.zipcode right join Residences as rp
on find(compress(Residence),statecode,'it')
or find(compress(Residence),stname(statecode),'it')
or find(compress(Residence),compress(city),'it')
or find(compress(Residence),cats(put(zip,z5.)),'it')
order by Residence;
quit;

*parse address;
data Parsed;
set ParseThese;
if anyalpha(Residence) then do;
   CityPattern=cats("/\b",upcase(city),"\b/");
   StatePattern=cats("/\b",upcase(statecode),"\b|\b",stname(statecode),"\b/");
   CityParser=prxparse(CityPattern);
   StateParser=prxparse(StatePattern);
   CityMatch=ifn(missing(Residence)=missing(city)=0,prxmatch(CityParser,compbl(Residence)),0);
   StateMatch=ifn(missing(Residence)=missing(statecode)=0,prxmatch(StateParser,compbl(Residence)),0);
end;
if anydigit(Residence) then do;
   ZipPattern=cats("/\b",put(zip,z5.),"\b/");
   ZipParser=prxparse(ZipPattern);
   ZipMatch=ifn(missing(Residence)=missing(zip)=0,prxmatch(ZipParser,compbl(Residence)),0);
end;
if missing(CityMatch) then CityMatch=0;
if missing(StateMatch) then StateMatch=0;
if missing(ZipMatch) then ZipMatch=0;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@Mike_B wrote:

@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.

 

 


You may want to learn about the functions ZIPCITY, which will return the name of a city and two letter state code, ZIPNAME, which returns upper case state or territory name, ZIPNAMEL, mixed case state or territory and ZIPSTATE, two character state code instead of a whole bunch of hashing.

 

Just because the ZIP doesn't match the city you are assuming the Zip is wrong? My experience with poor data entry is that about 30% of the time the City is wrong if the state matches the Zip. There is also an obnoxious behavior in some rural areas that the nearest post office is in a different state than the residence.

 

Poor data is just that, poor. And think of tasks like this as "job security". That's the only way I get through some of the junk I have. I have had "addresses" that included the pithy "See the mother", and "watch out for dogs" as part of the street address that I was expected to geocode for mapping.

 

View solution in original post

6 REPLIES 6
ballardw
Super User

If I were doing something along those lines I would fuzzy match on zipcode only. Identify the records without zipcode and then match them on one or more of the remaining bits starting with the city and state (both) in some form. Then those that didn't match go into the garbage hopper which would need more work.

That should reduce the number of records a whole bunch.

 

proc sql;
   create table ZIPMATCH as
   select distinct city
   	       ,statecode
   	       ,zip
                  ,rp.Residence
   from sashelp.zipcode right join Residences as rp
   on find(compress(Residence),cats(put(zip,z5.)),'it')
   where not missing (zip)
   order by Residence
   ;
quit;

Then create a set from Residence that removes any of the records in ZIPMATCH. Look at the SQL Except such as

proc sql;
   create table notzip as
   select Residence from  Residences
   except
   select Residence from Zipmatch
   ;
quit;

Looking for matches of City and State, or statecode I strongly suggest using FINDW instead of find as you'll get the AL statecode in the City value of Alhambra and lots of similar.

I would even be tempted to only use the state code first.

 

Similar, remove the city and state matches.

 

Then a city match only, at this point I would add a text box to manually mark a preferred "match" as you may get multiple. (VIEWTABLE lib.dsname openmode=edit; ) in the command box or similar.

Mike_B
Obsidian | Level 7

@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.

 

I ended up using a hash, which I've never done before, but (after many versions) it seems to be working.

 

data Residences;
length Residence $50;
do Residence=
"MIAMI GARDENS/FL/33055",
"SARASOTA FL/ZIP)",
"TITUSVILLE, FL)",
"ADJUNTAS PRTE/ZIP)",
"OPA LOCKA/TARPON SPRINGS",
"SAINT PETERSBURG/FL/33710",
"DELAND,FL 32720)",
"DORAL, FL 33172)",
"MIAMI FL 33168P)",
"ADEL, GA 31620)",
"OSTRICA, LA 70041)",
"KINGSLAND, GA )",
"MILTON FL 32570)",
"LARGO,FL 32773 )",
"FORT PIERCE/FL/34945",
"+IAMI,FL",
"HOMELESS (850)",
"SABASTIAN, FLZIP)",
"PLANT CITY/TAMPA AREA",
"BUSHNELL, FZIP)"
;
output;
end;
run;

proc sql;
create table ParseThese as 
select distinct find(compress(Residence),compress(city),'i') as CityFound
,case when find(compress(Residence),compress(stname(statecode)),'i') then 1
      when findw(Residence,statecode,'z','aki') then 2
      else 0
end as StateFound
,find(compress(Residence),cats(put(zip,z5.))) as ZipFound
,rp.Residence
,city
,statecode
,zip
from sashelp.zipcode,Residences as rp
where findW(Residence,statecode,'z','aki')
      or find(compress(Residence),compress(stname(statecode)),'i')
      or find(compress(Residence),compress(city),'i')
      or find(compress(Residence),cats(put(zip,z5.)))
order by Residence,city,statecode,zip,CityFound,StateFound,ZipFound
;
quit;

data Parsed;
length Residence $50 CITY $35 STATECODE $2 ZIP 8 CityFound StateFound ZipFound 8 CityPattern StatePattern ZipPattern $250 CityParser StateParser ZipParser 8 CityMatch StateMatch ZipMatch 8;

if _n_ = 1 then do;
declare hash myhash(dataset:'ParseThese', multidata: 'y', ordered: 'y');
rc = myhash.DEFINEKEY('Residence','City','StateCode','ZIP');
rc = myhash.DEFINEDATA('Residence','CityFound','StateFound','ZipFound','City','StateCode','ZIP');
rc = myhash.DEFINEDONE();
CALL MISSING(Residence,CityFound,StateFound,ZipFound,City,StateCode,ZIP,
CityPattern,CityParser,CityMatch,StatePattern,StateParser,StateMatch,ZipPattern,ZipParser,ZipMatch);
DECLARE hiter myiter('myhash');
end;

rc = myiter.first();
do while (rc eq 0);
if CityFound then do;
   CityPattern=cats("/\b",upcase(city),"\b/");
   CityParser=prxparse(CityPattern);
   CityMatch=ifn(missing(Residence)=missing(city)=0,prxmatch(CityParser,compbl(Residence)),0);
   call prxfree(CityParser);
end;
else call missing(CityPattern,CityParser,CityMatch);
if StateFound then do;
   StatePattern=cats("/\b",upcase(statecode),"\b|\b",stname(statecode),"\b/");
   StateParser=prxparse(StatePattern);
   StateMatch=ifn(missing(Residence)=missing(statecode)=0,prxmatch(StateParser,compbl(Residence)),0);
   call prxfree(StateParser);
end;
else call missing(StatePattern,StateParser,StateMatch);
if ZipFound then do;
   ZipPattern=cats("/\b",put(zip,z5.),"\b/");
   ZipParser=prxparse(ZipPattern);
   ZipMatch=ifn(missing(Residence)=missing(zip)=0,prxmatch(ZipParser,compbl(Residence)),0);
   call prxfree(ZipParser);
end;
else call missing(ZipPattern,ZipParser,ZipMatch);
if sum(CityMatch,StateMatch,ZipMatch) then output;
rc = myiter.next();
end;
run;
ballardw
Super User

@Mike_B wrote:

@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.

 

 


You may want to learn about the functions ZIPCITY, which will return the name of a city and two letter state code, ZIPNAME, which returns upper case state or territory name, ZIPNAMEL, mixed case state or territory and ZIPSTATE, two character state code instead of a whole bunch of hashing.

 

Just because the ZIP doesn't match the city you are assuming the Zip is wrong? My experience with poor data entry is that about 30% of the time the City is wrong if the state matches the Zip. There is also an obnoxious behavior in some rural areas that the nearest post office is in a different state than the residence.

 

Poor data is just that, poor. And think of tasks like this as "job security". That's the only way I get through some of the junk I have. I have had "addresses" that included the pithy "See the mother", and "watch out for dogs" as part of the street address that I was expected to geocode for mapping.

 

Mike_B
Obsidian | Level 7

Thanks for the suggestions. I just got back to this project and wrapped it up and ended up using zipcity and zipstate as you suggested.

Patrick
Opal | Level 21

If you run below code...

proc setinit;
run;

...do you see in the SAS log anywhere?

DataFlux

DataFlux is the SAS DQ module which would allow you to tokenize and standardize address strings which would make your job much easier.

 

If you haven't got this SAS module and given your sample data then also consider to use the services of a specialized external provider to clean-up and standardize your address data.

Mike_B
Obsidian | Level 7

Thank you for the suggestion. Unfortunately, we do not have DataFlux. I just got back to this project a couple weeks ago and was able to complete it by processing the set 1 million records at a time in a do-loop.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 786 views
  • 3 likes
  • 3 in conversation