Hi all! I am in the process in completing physical address cleanup. One of the issues I want to clean is removing the city name in the address line. I have created mock data, via datalines, to illustrate example data in a table called "Address_Mock". I have a table called "ZipCodes_Clean" which contains all U.S. states and territories zip codes with all the possible cities attributed to those zip codes in columns named City1 to City130. The following is an example from the table:
Zipcode State Country WorldRegion City1 City2 City3 City4 City5 City6 City7 City8 City9 City10
33172 FL US NA MIAMI DORAL DORAL BRANCH SWEETWATER WEST MIAMI
I am using two SAS functions (FINDW and TRANWRD) in a conditional IF-THEN statement to locate if the city is in the address line, create new column containing the address line with the city replaced by blank space. However, the final result has been blank result in the new column.
Below is the mock code of how I am currently trying to resolve the issue:
data Address_Mock;
infile datalines dsd;
input Name: $8. Address_Line: $26. ZipCode ;
datalines;
MICHELLE,"10597 NW 7 TER MIAMI",33172
NICOLE,"10597 NW 7 TER DORAL",33172
MARIA,"10597 NW 7 TER WEST MIAMI",33172
ANTONIO,"10597 NW 7 TER SWEETWATER",33172
;
run;
proc sql;
create table Address_Join as
select t1.*, t2.*
from Address_Mock as t1 left join ZipCodes_Clean as t2 on t1.ZipCode=t2.ZipCode;
quit;
data Address_Loop;
set Address_Join;
array cols {*} city1-city130 ;
do i=1 to 130;
if FINDW(Address_Line,cols{i}) then New_Address_Line=TRANWRD(Address_Line,cols{i},"");
end;
run;
I'll be very appreciative if anyone sees the error in the code or I'm open to trying alternative code. Thanks in advance!
I would propose a few changes:
/* City names must be ordered by decreasing length (i.e. WEST MIAMI before MIAMI) */
data zip;
infile datalines truncover;
input Zipcode (State Country WorldRegion City1 City2 City3 City4 City5 City6 City7 City8 City9 City10) (&:$20.);
datalines;
33172 FL US NA WEST MIAMI MIAMI DORAL BRANCH DORAL SWEETWATER
;
data Addr;
infile datalines dsd;
input Name: $8. Line: $26. ZipCode ;
datalines;
MICHELLE,"10597 NW 7 TER MIAMI",33172
NICOLE,"10597 NW 7 TER DORAL",33172
MARIA,"10597 NW 7 TER WEST MIAMI",33172
ANTONIO,"10597 NW 7 TER SWEETWATER",33172
;
proc sql;
create table A as
select *
from Addr as t1 left join Zip as t2 on t1.ZipCode=t2.ZipCode;
quit;
/* Remove city names at the end of address lines */
data B;
set A;
array cols {*} city1-city130 ;
do i=1 to dim(cols) while(not missing(cols{i}));
pos = length(line)-length(cols{i})+1;
if pos > 2 then
if notalpha(substr(line,pos-1,1)) then
if substr(line, pos) = cols{i} then do;
line = substr(line,1,pos-2);
leave;
end;
end;
drop pos city: i;
run;
You might speed things up a bit with
do i = 1 to 130 while(not missing(cols{i}));
I would propose a few changes:
/* City names must be ordered by decreasing length (i.e. WEST MIAMI before MIAMI) */
data zip;
infile datalines truncover;
input Zipcode (State Country WorldRegion City1 City2 City3 City4 City5 City6 City7 City8 City9 City10) (&:$20.);
datalines;
33172 FL US NA WEST MIAMI MIAMI DORAL BRANCH DORAL SWEETWATER
;
data Addr;
infile datalines dsd;
input Name: $8. Line: $26. ZipCode ;
datalines;
MICHELLE,"10597 NW 7 TER MIAMI",33172
NICOLE,"10597 NW 7 TER DORAL",33172
MARIA,"10597 NW 7 TER WEST MIAMI",33172
ANTONIO,"10597 NW 7 TER SWEETWATER",33172
;
proc sql;
create table A as
select *
from Addr as t1 left join Zip as t2 on t1.ZipCode=t2.ZipCode;
quit;
/* Remove city names at the end of address lines */
data B;
set A;
array cols {*} city1-city130 ;
do i=1 to dim(cols) while(not missing(cols{i}));
pos = length(line)-length(cols{i})+1;
if pos > 2 then
if notalpha(substr(line,pos-1,1)) then
if substr(line, pos) = cols{i} then do;
line = substr(line,1,pos-2);
leave;
end;
end;
drop pos city: i;
run;
Great modifications to the code! I was able to get the desired outcome. Thanks for noting to organize the city names by descending character length; it is definitely required in order to use the positional approach to remove the city names.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.