BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
axeloop
Calcite | Level 5

Hi Forum,

I'm cleaning about 100,000 addresses and the address variable contains the house number, street direction, street, street type, and unit numbers.  I only need the house number, street direction, street, and street type with one space and without periods. I do not need any 1/2, 1/4, or 3/4 with the housing number.

Here's an example of the type of addresses I have:

   12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

There are leading and trailing blanks and multiple spaces .

This is the output I want:

12345 SAS ST

111 COMPUTER GREEN LN

2222 N PROCEDURE RD

2255 N PROCEDURE RD

333 W MACRO DR

44444 SEMICOLON AVE

5555 FREQ BLVD

How do I do this? Thanks you very much!

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

You may want to look at a SAS product called dataflux.  Short of that, you would have to identify and correct for all exceptions.  e.g., just given your example addresses, here is one set of code that would correct them:

data want (keep=address);

  informat address $80.;

  input @;

  _infile_=translate(_infile_," ",".");

  _infile_=compbl(_infile_);

  input address &;

  number=scan(address,1);

  call scan(address,2,position,length);

  rest=substr(address,position);

  if find(rest,"N N ") then rest=substr(rest,find(rest,"N N ")+2);

  else if find(rest,"E E ") then rest=substr(rest,find(rest,"E E ")+2);

  else if find(rest,"S S ") then rest=substr(rest,find(rest,"S S ")+2);

  else if find(rest,"W W ") then rest=substr(rest,find(rest,"W W ")+2);

  if find(rest,"1/2") then rest=substr(rest,find(rest,"1/2")+4);

  else if find(rest,"1/3") then rest=substr(rest,find(rest,"1/3")+4);

  else if find(rest,"1/4") then rest=substr(rest,find(rest,"1/4")+4);

  else if find(rest,"2/3") then rest=substr(rest,find(rest,"2/3")+4);

  else if find(rest,"3/4") then rest=substr(rest,find(rest,"3/4")+4);

  if find(rest," ST ") then rest=substr(rest,1,find(rest," ST ")+2);

  else if find(rest," RD ") then rest=substr(rest,1,find(rest," RD ")+2);

  else if find(rest," LN ") then rest=substr(rest,1,find(rest," LN ")+2);

  else if find(rest," DR ") then rest=substr(rest,1,find(rest," DR ")+2);

  else if find(rest," AVE ") then rest=substr(rest,1,find(rest," AVE ")+3);

  else if find(rest," BLVD ") then rest=substr(rest,1,find(rest," BLVD ")+4);

  address=catx(" ",number,rest);

  cards;

12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

;

run;

View solution in original post

20 REPLIES 20
art297
Opal | Level 21

You may want to look at a SAS product called dataflux.  Short of that, you would have to identify and correct for all exceptions.  e.g., just given your example addresses, here is one set of code that would correct them:

data want (keep=address);

  informat address $80.;

  input @;

  _infile_=translate(_infile_," ",".");

  _infile_=compbl(_infile_);

  input address &;

  number=scan(address,1);

  call scan(address,2,position,length);

  rest=substr(address,position);

  if find(rest,"N N ") then rest=substr(rest,find(rest,"N N ")+2);

  else if find(rest,"E E ") then rest=substr(rest,find(rest,"E E ")+2);

  else if find(rest,"S S ") then rest=substr(rest,find(rest,"S S ")+2);

  else if find(rest,"W W ") then rest=substr(rest,find(rest,"W W ")+2);

  if find(rest,"1/2") then rest=substr(rest,find(rest,"1/2")+4);

  else if find(rest,"1/3") then rest=substr(rest,find(rest,"1/3")+4);

  else if find(rest,"1/4") then rest=substr(rest,find(rest,"1/4")+4);

  else if find(rest,"2/3") then rest=substr(rest,find(rest,"2/3")+4);

  else if find(rest,"3/4") then rest=substr(rest,find(rest,"3/4")+4);

  if find(rest," ST ") then rest=substr(rest,1,find(rest," ST ")+2);

  else if find(rest," RD ") then rest=substr(rest,1,find(rest," RD ")+2);

  else if find(rest," LN ") then rest=substr(rest,1,find(rest," LN ")+2);

  else if find(rest," DR ") then rest=substr(rest,1,find(rest," DR ")+2);

  else if find(rest," AVE ") then rest=substr(rest,1,find(rest," AVE ")+3);

  else if find(rest," BLVD ") then rest=substr(rest,1,find(rest," BLVD ")+4);

  address=catx(" ",number,rest);

  cards;

12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

;

run;

axeloop
Calcite | Level 5

Thank you! You have been very helpful. This gives me a good starting point. I will look into Dataflux.

axeloop
Calcite | Level 5

Also, how can I separate the street and the street type if they run together like this:


2255 N N PROCEDURERD # 3

Haikuo
Onyx | Level 15

The following code will work on your data for now, however, if it is more than ONE incidence, then more complicated syntax will be needed:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

2255 N N PROCEDURERD # 3

;

data want;

   set have;

   retain _r _r1;

    _r=prxparse("/ +((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

    _r1=prxparse("/((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

   _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    if _len>0 then address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

    else do;

      _t3=prxchange("s/(((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +)/ \1/i", -1,_t2);

      call prxsubstr(_r,_t3,_start,_len);

      address=compbl(compress(left(substr(_t3,1,_start+_len-1)),'.'));

    end;

  drop _:;

  run;

  proc print;run;

Haikuo

art297
Opal | Level 21

: Well done!

Haikuo
Onyx | Level 15

Thanks, Art. I am trying. Here is an amendment to deal with multiple occurrence of 'RD, ST' etc, the code will only fix the last one:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

2255 N N PROCEDURERD # 3

2255 N N INSANEST PROCEDURERD # 3

;

data want;

   set have;

   retain _r _r1;

    _r=prxparse("/ +((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

    _r1=prxparse("/((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

   _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    if _len>0 then address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

    else do;

    _st=1;

    _stop=lengthn(_t2);

    call prxnext(_r1,_st,_stop,_t2,_p,_l);

      do while (_p > 0);

        _pend=_p;

        _lend=_l;

           call prxnext(_r1,_st,_stop,_t2,_p,_l);

      end;

      substr(_t2,_pend)=' '||substr(_t2,_pend,_lend);

      address=compbl(compress(left(_t2),'.'));

    end;

  drop _:;

  run;

  proc print;run;

Haikuo

Haikuo
Onyx | Level 15

If you don't mind some PRX function, you may save some typing, the following code works on your as-is data:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

;

data want;

   set have;

   retain _r;

    _r=prxparse("/ +(st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd) +/i");

    _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

proc print;run;

Haikuo

art297
Opal | Level 21

The expressions would require some tightening.  e.g., the following would totally fail:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

;

data want;

   set have;

   retain _r;

    _r=prxparse("/ +(st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd) +/i");

    _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

Haikuo
Onyx | Level 15

Good catch, Art. I don't really know why it acts that way, have to wait for a regular expression expert, for now, I had to put it as literal:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

;

data want;

   set have;

   retain _r;

    _r=prxparse("/( st )|( ln )|( rd )|( dr )|( ave )|( cir )|( blvd )/i");

   _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

  proc print;run;

Haikuo

art297
Opal | Level 21

I haven't tested the other conditions but, for the one I mentioned, I think you only have to refine the boundary definitions.  e.g.:

data want;

   set have;

   retain _r;

    _r=prxparse("/ +(st\b)|(ln\b)|(rd\b)|(dr\b)|(ave\b)|(cir\b)|(blvd\b) +/i");

    _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

Haikuo
Onyx | Level 15

That is good. I know \b, while never have used it. Learn some. Thanks, Art. You are as sharp as when you are twenties.

Haikuo

art297
Opal | Level 21

: I presume you meant when I was in my twenties.  If so, interesting pick of time frame as I wasn't very sharp in my early twenties, in fact flunked out of community college during that time period.

Fortunately, I followed that up with enlisting in the Air Force for four years and, subsequently, used the GI Bill to go back to school and earn a BS, MS and PhD (picking up SAS along the way).

Linlin
Lapis Lazuli | Level 10

Hi Art,

Out of curiosity, I heard about GI Bill in America. Is there also a GI Bill in Canada or you mean the GI Bill in America?

Thank you!

art297
Opal | Level 21

: America.  I was in the US Air Force.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 20 replies
  • 3019 views
  • 6 likes
  • 5 in conversation