Hi Forum,
I'm cleaning about 100,000 addresses and the address variable contains the house number, street direction, street, street type, and unit numbers. I only need the house number, street direction, street, and street type with one space and without periods. I do not need any 1/2, 1/4, or 3/4 with the housing number.
Here's an example of the type of addresses I have:
12345 1/2 SAS ST APT C
111 COMPUTER GREEN LN APT1
2222 N. N PROCEDURE RD #2
2255 N N PROCEDURE RD # 3
333 W W. MACRO DR UNIT4
44444 1/4 SEMICOLON AVE STE 5 STE 5
5555 FREQ BLVD 57
There are leading and trailing blanks and multiple spaces .
This is the output I want:
12345 SAS ST
111 COMPUTER GREEN LN
2222 N PROCEDURE RD
2255 N PROCEDURE RD
333 W MACRO DR
44444 SEMICOLON AVE
5555 FREQ BLVD
How do I do this? Thanks you very much!
You may want to look at a SAS product called dataflux. Short of that, you would have to identify and correct for all exceptions. e.g., just given your example addresses, here is one set of code that would correct them:
data want (keep=address);
informat address $80.;
input @;
_infile_=translate(_infile_," ",".");
_infile_=compbl(_infile_);
input address &;
number=scan(address,1);
call scan(address,2,position,length);
rest=substr(address,position);
if find(rest,"N N ") then rest=substr(rest,find(rest,"N N ")+2);
else if find(rest,"E E ") then rest=substr(rest,find(rest,"E E ")+2);
else if find(rest,"S S ") then rest=substr(rest,find(rest,"S S ")+2);
else if find(rest,"W W ") then rest=substr(rest,find(rest,"W W ")+2);
if find(rest,"1/2") then rest=substr(rest,find(rest,"1/2")+4);
else if find(rest,"1/3") then rest=substr(rest,find(rest,"1/3")+4);
else if find(rest,"1/4") then rest=substr(rest,find(rest,"1/4")+4);
else if find(rest,"2/3") then rest=substr(rest,find(rest,"2/3")+4);
else if find(rest,"3/4") then rest=substr(rest,find(rest,"3/4")+4);
if find(rest," ST ") then rest=substr(rest,1,find(rest," ST ")+2);
else if find(rest," RD ") then rest=substr(rest,1,find(rest," RD ")+2);
else if find(rest," LN ") then rest=substr(rest,1,find(rest," LN ")+2);
else if find(rest," DR ") then rest=substr(rest,1,find(rest," DR ")+2);
else if find(rest," AVE ") then rest=substr(rest,1,find(rest," AVE ")+3);
else if find(rest," BLVD ") then rest=substr(rest,1,find(rest," BLVD ")+4);
address=catx(" ",number,rest);
cards;
12345 1/2 SAS ST APT C
111 COMPUTER GREEN LN APT1
2222 N. N PROCEDURE RD #2
2255 N N PROCEDURE RD # 3
333 W W. MACRO DR UNIT4
44444 1/4 SEMICOLON AVE STE 5 STE 5
5555 FREQ BLVD 57
;
run;
You may want to look at a SAS product called dataflux. Short of that, you would have to identify and correct for all exceptions. e.g., just given your example addresses, here is one set of code that would correct them:
data want (keep=address);
informat address $80.;
input @;
_infile_=translate(_infile_," ",".");
_infile_=compbl(_infile_);
input address &;
number=scan(address,1);
call scan(address,2,position,length);
rest=substr(address,position);
if find(rest,"N N ") then rest=substr(rest,find(rest,"N N ")+2);
else if find(rest,"E E ") then rest=substr(rest,find(rest,"E E ")+2);
else if find(rest,"S S ") then rest=substr(rest,find(rest,"S S ")+2);
else if find(rest,"W W ") then rest=substr(rest,find(rest,"W W ")+2);
if find(rest,"1/2") then rest=substr(rest,find(rest,"1/2")+4);
else if find(rest,"1/3") then rest=substr(rest,find(rest,"1/3")+4);
else if find(rest,"1/4") then rest=substr(rest,find(rest,"1/4")+4);
else if find(rest,"2/3") then rest=substr(rest,find(rest,"2/3")+4);
else if find(rest,"3/4") then rest=substr(rest,find(rest,"3/4")+4);
if find(rest," ST ") then rest=substr(rest,1,find(rest," ST ")+2);
else if find(rest," RD ") then rest=substr(rest,1,find(rest," RD ")+2);
else if find(rest," LN ") then rest=substr(rest,1,find(rest," LN ")+2);
else if find(rest," DR ") then rest=substr(rest,1,find(rest," DR ")+2);
else if find(rest," AVE ") then rest=substr(rest,1,find(rest," AVE ")+3);
else if find(rest," BLVD ") then rest=substr(rest,1,find(rest," BLVD ")+4);
address=catx(" ",number,rest);
cards;
12345 1/2 SAS ST APT C
111 COMPUTER GREEN LN APT1
2222 N. N PROCEDURE RD #2
2255 N N PROCEDURE RD # 3
333 W W. MACRO DR UNIT4
44444 1/4 SEMICOLON AVE STE 5 STE 5
5555 FREQ BLVD 57
;
run;
Thank you! You have been very helpful. This gives me a good starting point. I will look into Dataflux.
Also, how can I separate the street and the street type if they run together like this:
2255 N N PROCEDURERD # 3
The following code will work on your data for now, however, if it is more than ONE incidence, then more complicated syntax will be needed:
data have;
infile cards truncover;
input add $char100.;
cards;
12345 1/2 START ST APT C
111 DRUMMER LN APT1
2222 N. N CIRCULAR RD #2
2255 N N ALLNIGHT RD # 3
333 W W. MACRODRIVEN DR UNIT4
44444 1/4 DROPLIGHT AVE STE 5 STE 5
5555 STANDARDFREQ BLVD 57
2255 N N PROCEDURERD # 3
;
data want;
set have;
retain _r _r1;
_r=prxparse("/ +((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");
_r1=prxparse("/((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");
_t1=prxchange("s/ +\d\/\d +/ /", -1, add);
_t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);
call prxsubstr(_r,_t2,_start,_len);
if _len>0 then address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));
else do;
_t3=prxchange("s/(((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +)/ \1/i", -1,_t2);
call prxsubstr(_r,_t3,_start,_len);
address=compbl(compress(left(substr(_t3,1,_start+_len-1)),'.'));
end;
drop _:;
run;
proc print;run;
Haikuo
Thanks, Art. I am trying. Here is an amendment to deal with multiple occurrence of 'RD, ST' etc, the code will only fix the last one:
data have;
infile cards truncover;
input add $char100.;
cards;
12345 1/2 START ST APT C
111 DRUMMER LN APT1
2222 N. N CIRCULAR RD #2
2255 N N ALLNIGHT RD # 3
333 W W. MACRODRIVEN DR UNIT4
44444 1/4 DROPLIGHT AVE STE 5 STE 5
5555 STANDARDFREQ BLVD 57
2255 N N PROCEDURERD # 3
2255 N N INSANEST PROCEDURERD # 3
;
data want;
set have;
retain _r _r1;
_r=prxparse("/ +((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");
_r1=prxparse("/((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");
_t1=prxchange("s/ +\d\/\d +/ /", -1, add);
_t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);
call prxsubstr(_r,_t2,_start,_len);
if _len>0 then address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));
else do;
_st=1;
_stop=lengthn(_t2);
call prxnext(_r1,_st,_stop,_t2,_p,_l);
do while (_p > 0);
_pend=_p;
_lend=_l;
call prxnext(_r1,_st,_stop,_t2,_p,_l);
end;
substr(_t2,_pend)=' '||substr(_t2,_pend,_lend);
address=compbl(compress(left(_t2),'.'));
end;
drop _:;
run;
proc print;run;
Haikuo
If you don't mind some PRX function, you may save some typing, the following code works on your as-is data:
data have;
infile cards truncover;
input add $char100.;
cards;
12345 1/2 SAS ST APT C
111 COMPUTER GREEN LN APT1
2222 N. N PROCEDURE RD #2
2255 N N PROCEDURE RD # 3
333 W W. MACRO DR UNIT4
44444 1/4 SEMICOLON AVE STE 5 STE 5
5555 FREQ BLVD 57
;
data want;
set have;
retain _r;
_r=prxparse("/ +(st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd) +/i");
_t1=prxchange("s/ +\d\/\d +/ /", -1, add);
_t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);
call prxsubstr(_r,_t2,_start,_len);
address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));
drop _:;
run;
proc print;run;
Haikuo
The expressions would require some tightening. e.g., the following would totally fail:
data have;
infile cards truncover;
input add $char100.;
cards;
12345 1/2 START ST APT C
111 DRUMMER LN APT1
2222 N. N CIRCULAR RD #2
2255 N N ALLNIGHT RD # 3
333 W W. MACRODRIVEN DR UNIT4
44444 1/4 DROPLIGHT AVE STE 5 STE 5
5555 STANDARDFREQ BLVD 57
;
data want;
set have;
retain _r;
_r=prxparse("/ +(st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd) +/i");
_t1=prxchange("s/ +\d\/\d +/ /", -1, add);
_t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);
call prxsubstr(_r,_t2,_start,_len);
address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));
drop _:;
run;
Good catch, Art. I don't really know why it acts that way, have to wait for a regular expression expert, for now, I had to put it as literal:
data have;
infile cards truncover;
input add $char100.;
cards;
12345 1/2 START ST APT C
111 DRUMMER LN APT1
2222 N. N CIRCULAR RD #2
2255 N N ALLNIGHT RD # 3
333 W W. MACRODRIVEN DR UNIT4
44444 1/4 DROPLIGHT AVE STE 5 STE 5
5555 STANDARDFREQ BLVD 57
;
data want;
set have;
retain _r;
_r=prxparse("/( st )|( ln )|( rd )|( dr )|( ave )|( cir )|( blvd )/i");
_t1=prxchange("s/ +\d\/\d +/ /", -1, add);
_t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);
call prxsubstr(_r,_t2,_start,_len);
address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));
drop _:;
run;
proc print;run;
Haikuo
I haven't tested the other conditions but, for the one I mentioned, I think you only have to refine the boundary definitions. e.g.:
data want;
set have;
retain _r;
_r=prxparse("/ +(st\b)|(ln\b)|(rd\b)|(dr\b)|(ave\b)|(cir\b)|(blvd\b) +/i");
_t1=prxchange("s/ +\d\/\d +/ /", -1, add);
_t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);
call prxsubstr(_r,_t2,_start,_len);
address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));
drop _:;
run;
That is good. I know \b, while never have used it. Learn some. Thanks, Art. You are as sharp as when you are twenties.
Haikuo
: I presume you meant when I was in my twenties. If so, interesting pick of time frame as I wasn't very sharp in my early twenties, in fact flunked out of community college during that time period.
Fortunately, I followed that up with enlisting in the Air Force for four years and, subsequently, used the GI Bill to go back to school and earn a BS, MS and PhD (picking up SAS along the way).
Hi Art,
Out of curiosity, I heard about GI Bill in America. Is there also a GI Bill in Canada or you mean the GI Bill in America?
Thank you!
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
