I have a data set of 3k observations with a "city, state, zip" free-text variable named Residence which I need to break out into three separate columns. Because many of the addresses are garbage, I've resorted to joining it to sashelp.zipcode on a fuzzy match and then using regular expressions to identify city, state and zip matches. The join creates a data set of 9,866,133 records. Parsing the set takes days. What makes things worse is my set of 3k records is just a small subset of the set I will eventually need to parse monthly. Does anyone have any tips on how to speed things up? I'm using 64-bit base SAS 9.4M6, no fancy data-mining packages. I've tried running the program on my machine versus server SAS, and it takes a million years either way.
data Residences; length Residence $50; do Residence= "MIAMI GARDENS/FL/33055", "SARASOTA FL/ZIP)", "TITUSVILLE, FL)", "ADJUNTAS PRTE/ZIP)", "OPA LOCKA/TARPON SPRINGS", "SAINT PETERSBURG/FL/33710", "DELAND,FL 32720)", "DORAL, FL 33172)", "MIAMI FL 33168P)", "ADEL, GA 31620)", "OSTRICA, LA 70041)", "KINGSLAND, GA )", "MILTON FL 32570)", "LARGO,FL 32773 )", "FORT PIERCE/FL/34945", "+IAMI,FL", "HOMELESS (850)", "SABASTIAN, FLZIP)", "PLANT CITY/TAMPA AREA", "BUSHNELL, FZIP)" ; output; end; run; proc sql; create table ParseThese as select distinct city ,statecode ,zip ,rp.Residence from sashelp.zipcode right join Residences as rp on find(compress(Residence),statecode,'it') or find(compress(Residence),stname(statecode),'it') or find(compress(Residence),compress(city),'it') or find(compress(Residence),cats(put(zip,z5.)),'it') order by Residence; quit; *parse address; data Parsed; set ParseThese; if anyalpha(Residence) then do; CityPattern=cats("/\b",upcase(city),"\b/"); StatePattern=cats("/\b",upcase(statecode),"\b|\b",stname(statecode),"\b/"); CityParser=prxparse(CityPattern); StateParser=prxparse(StatePattern); CityMatch=ifn(missing(Residence)=missing(city)=0,prxmatch(CityParser,compbl(Residence)),0); StateMatch=ifn(missing(Residence)=missing(statecode)=0,prxmatch(StateParser,compbl(Residence)),0); end; if anydigit(Residence) then do; ZipPattern=cats("/\b",put(zip,z5.),"\b/"); ZipParser=prxparse(ZipPattern); ZipMatch=ifn(missing(Residence)=missing(zip)=0,prxmatch(ZipParser,compbl(Residence)),0); end; if missing(CityMatch) then CityMatch=0; if missing(StateMatch) then StateMatch=0; if missing(ZipMatch) then ZipMatch=0; run;
@Mike_B wrote:
@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.
You may want to learn about the functions ZIPCITY, which will return the name of a city and two letter state code, ZIPNAME, which returns upper case state or territory name, ZIPNAMEL, mixed case state or territory and ZIPSTATE, two character state code instead of a whole bunch of hashing.
Just because the ZIP doesn't match the city you are assuming the Zip is wrong? My experience with poor data entry is that about 30% of the time the City is wrong if the state matches the Zip. There is also an obnoxious behavior in some rural areas that the nearest post office is in a different state than the residence.
Poor data is just that, poor. And think of tasks like this as "job security". That's the only way I get through some of the junk I have. I have had "addresses" that included the pithy "See the mother", and "watch out for dogs" as part of the street address that I was expected to geocode for mapping.
If I were doing something along those lines I would fuzzy match on zipcode only. Identify the records without zipcode and then match them on one or more of the remaining bits starting with the city and state (both) in some form. Then those that didn't match go into the garbage hopper which would need more work.
That should reduce the number of records a whole bunch.
proc sql; create table ZIPMATCH as select distinct city ,statecode ,zip ,rp.Residence from sashelp.zipcode right join Residences as rp on find(compress(Residence),cats(put(zip,z5.)),'it') where not missing (zip) order by Residence ; quit;
Then create a set from Residence that removes any of the records in ZIPMATCH. Look at the SQL Except such as
proc sql; create table notzip as select Residence from Residences except select Residence from Zipmatch ; quit;
Looking for matches of City and State, or statecode I strongly suggest using FINDW instead of find as you'll get the AL statecode in the City value of Alhambra and lots of similar.
I would even be tempted to only use the state code first.
Similar, remove the city and state matches.
Then a city match only, at this point I would add a text box to manually mark a preferred "match" as you may get multiple. (VIEWTABLE lib.dsname openmode=edit; ) in the command box or similar.
@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.
I ended up using a hash, which I've never done before, but (after many versions) it seems to be working.
data Residences; length Residence $50; do Residence= "MIAMI GARDENS/FL/33055", "SARASOTA FL/ZIP)", "TITUSVILLE, FL)", "ADJUNTAS PRTE/ZIP)", "OPA LOCKA/TARPON SPRINGS", "SAINT PETERSBURG/FL/33710", "DELAND,FL 32720)", "DORAL, FL 33172)", "MIAMI FL 33168P)", "ADEL, GA 31620)", "OSTRICA, LA 70041)", "KINGSLAND, GA )", "MILTON FL 32570)", "LARGO,FL 32773 )", "FORT PIERCE/FL/34945", "+IAMI,FL", "HOMELESS (850)", "SABASTIAN, FLZIP)", "PLANT CITY/TAMPA AREA", "BUSHNELL, FZIP)" ; output; end; run; proc sql; create table ParseThese as select distinct find(compress(Residence),compress(city),'i') as CityFound ,case when find(compress(Residence),compress(stname(statecode)),'i') then 1 when findw(Residence,statecode,'z','aki') then 2 else 0 end as StateFound ,find(compress(Residence),cats(put(zip,z5.))) as ZipFound ,rp.Residence ,city ,statecode ,zip from sashelp.zipcode,Residences as rp where findW(Residence,statecode,'z','aki') or find(compress(Residence),compress(stname(statecode)),'i') or find(compress(Residence),compress(city),'i') or find(compress(Residence),cats(put(zip,z5.))) order by Residence,city,statecode,zip,CityFound,StateFound,ZipFound ; quit; data Parsed; length Residence $50 CITY $35 STATECODE $2 ZIP 8 CityFound StateFound ZipFound 8 CityPattern StatePattern ZipPattern $250 CityParser StateParser ZipParser 8 CityMatch StateMatch ZipMatch 8; if _n_ = 1 then do; declare hash myhash(dataset:'ParseThese', multidata: 'y', ordered: 'y'); rc = myhash.DEFINEKEY('Residence','City','StateCode','ZIP'); rc = myhash.DEFINEDATA('Residence','CityFound','StateFound','ZipFound','City','StateCode','ZIP'); rc = myhash.DEFINEDONE(); CALL MISSING(Residence,CityFound,StateFound,ZipFound,City,StateCode,ZIP, CityPattern,CityParser,CityMatch,StatePattern,StateParser,StateMatch,ZipPattern,ZipParser,ZipMatch); DECLARE hiter myiter('myhash'); end; rc = myiter.first(); do while (rc eq 0); if CityFound then do; CityPattern=cats("/\b",upcase(city),"\b/"); CityParser=prxparse(CityPattern); CityMatch=ifn(missing(Residence)=missing(city)=0,prxmatch(CityParser,compbl(Residence)),0); call prxfree(CityParser); end; else call missing(CityPattern,CityParser,CityMatch); if StateFound then do; StatePattern=cats("/\b",upcase(statecode),"\b|\b",stname(statecode),"\b/"); StateParser=prxparse(StatePattern); StateMatch=ifn(missing(Residence)=missing(statecode)=0,prxmatch(StateParser,compbl(Residence)),0); call prxfree(StateParser); end; else call missing(StatePattern,StateParser,StateMatch); if ZipFound then do; ZipPattern=cats("/\b",put(zip,z5.),"\b/"); ZipParser=prxparse(ZipPattern); ZipMatch=ifn(missing(Residence)=missing(zip)=0,prxmatch(ZipParser,compbl(Residence)),0); call prxfree(ZipParser); end; else call missing(ZipPattern,ZipParser,ZipMatch); if sum(CityMatch,StateMatch,ZipMatch) then output; rc = myiter.next(); end; run;
@Mike_B wrote:
@ballardw, thank you for the help. I used your suggestion of FINDW for state (I think I'm using the optional arguments correctly to treat anything that isn't a letter as a word boundary?). I couldn't select only on zip code though, because the zip code sometimes doesn't match the city and state, and the data step that follows what I included here ranks records based on best match over all three variables. Also, thanks for the tip about EXCLUDE in PROC SQL. I've never seen that before and have since used it in another program.
You may want to learn about the functions ZIPCITY, which will return the name of a city and two letter state code, ZIPNAME, which returns upper case state or territory name, ZIPNAMEL, mixed case state or territory and ZIPSTATE, two character state code instead of a whole bunch of hashing.
Just because the ZIP doesn't match the city you are assuming the Zip is wrong? My experience with poor data entry is that about 30% of the time the City is wrong if the state matches the Zip. There is also an obnoxious behavior in some rural areas that the nearest post office is in a different state than the residence.
Poor data is just that, poor. And think of tasks like this as "job security". That's the only way I get through some of the junk I have. I have had "addresses" that included the pithy "See the mother", and "watch out for dogs" as part of the street address that I was expected to geocode for mapping.
Thanks for the suggestions. I just got back to this project and wrapped it up and ended up using zipcity and zipstate as you suggested.
If you run below code...
proc setinit;
run;
...do you see in the SAS log anywhere?
DataFlux
DataFlux is the SAS DQ module which would allow you to tokenize and standardize address strings which would make your job much easier.
If you haven't got this SAS module and given your sample data then also consider to use the services of a specialized external provider to clean-up and standardize your address data.
Thank you for the suggestion. Unfortunately, we do not have DataFlux. I just got back to this project a couple weeks ago and was able to complete it by processing the set 1 million records at a time in a do-loop.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.