DATA Step, Macro, Functions and more

Character Functions

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

Character Functions

Hi Forum,

I'm cleaning about 100,000 addresses and the address variable contains the house number, street direction, street, street type, and unit numbers.  I only need the house number, street direction, street, and street type with one space and without periods. I do not need any 1/2, 1/4, or 3/4 with the housing number.

Here's an example of the type of addresses I have:

   12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

There are leading and trailing blanks and multiple spaces .

This is the output I want:

12345 SAS ST

111 COMPUTER GREEN LN

2222 N PROCEDURE RD

2255 N PROCEDURE RD

333 W MACRO DR

44444 SEMICOLON AVE

5555 FREQ BLVD

How do I do this? Thanks you very much!


Accepted Solutions
Solution
‎11-10-2012 01:45 PM
PROC Star
Posts: 7,356

Re: Character Functions

You may want to look at a SAS product called dataflux.  Short of that, you would have to identify and correct for all exceptions.  e.g., just given your example addresses, here is one set of code that would correct them:

data want (keep=address);

  informat address $80.;

  input @;

  _infile_=translate(_infile_," ",".");

  _infile_=compbl(_infile_);

  input address &;

  number=scan(address,1);

  call scan(address,2,position,length);

  rest=substr(address,position);

  if find(rest,"N N ") then rest=substr(rest,find(rest,"N N ")+2);

  else if find(rest,"E E ") then rest=substr(rest,find(rest,"E E ")+2);

  else if find(rest,"S S ") then rest=substr(rest,find(rest,"S S ")+2);

  else if find(rest,"W W ") then rest=substr(rest,find(rest,"W W ")+2);

  if find(rest,"1/2") then rest=substr(rest,find(rest,"1/2")+4);

  else if find(rest,"1/3") then rest=substr(rest,find(rest,"1/3")+4);

  else if find(rest,"1/4") then rest=substr(rest,find(rest,"1/4")+4);

  else if find(rest,"2/3") then rest=substr(rest,find(rest,"2/3")+4);

  else if find(rest,"3/4") then rest=substr(rest,find(rest,"3/4")+4);

  if find(rest," ST ") then rest=substr(rest,1,find(rest," ST ")+2);

  else if find(rest," RD ") then rest=substr(rest,1,find(rest," RD ")+2);

  else if find(rest," LN ") then rest=substr(rest,1,find(rest," LN ")+2);

  else if find(rest," DR ") then rest=substr(rest,1,find(rest," DR ")+2);

  else if find(rest," AVE ") then rest=substr(rest,1,find(rest," AVE ")+3);

  else if find(rest," BLVD ") then rest=substr(rest,1,find(rest," BLVD ")+4);

  address=catx(" ",number,rest);

  cards;

12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

;

run;

View solution in original post


All Replies
Solution
‎11-10-2012 01:45 PM
PROC Star
Posts: 7,356

Re: Character Functions

You may want to look at a SAS product called dataflux.  Short of that, you would have to identify and correct for all exceptions.  e.g., just given your example addresses, here is one set of code that would correct them:

data want (keep=address);

  informat address $80.;

  input @;

  _infile_=translate(_infile_," ",".");

  _infile_=compbl(_infile_);

  input address &;

  number=scan(address,1);

  call scan(address,2,position,length);

  rest=substr(address,position);

  if find(rest,"N N ") then rest=substr(rest,find(rest,"N N ")+2);

  else if find(rest,"E E ") then rest=substr(rest,find(rest,"E E ")+2);

  else if find(rest,"S S ") then rest=substr(rest,find(rest,"S S ")+2);

  else if find(rest,"W W ") then rest=substr(rest,find(rest,"W W ")+2);

  if find(rest,"1/2") then rest=substr(rest,find(rest,"1/2")+4);

  else if find(rest,"1/3") then rest=substr(rest,find(rest,"1/3")+4);

  else if find(rest,"1/4") then rest=substr(rest,find(rest,"1/4")+4);

  else if find(rest,"2/3") then rest=substr(rest,find(rest,"2/3")+4);

  else if find(rest,"3/4") then rest=substr(rest,find(rest,"3/4")+4);

  if find(rest," ST ") then rest=substr(rest,1,find(rest," ST ")+2);

  else if find(rest," RD ") then rest=substr(rest,1,find(rest," RD ")+2);

  else if find(rest," LN ") then rest=substr(rest,1,find(rest," LN ")+2);

  else if find(rest," DR ") then rest=substr(rest,1,find(rest," DR ")+2);

  else if find(rest," AVE ") then rest=substr(rest,1,find(rest," AVE ")+3);

  else if find(rest," BLVD ") then rest=substr(rest,1,find(rest," BLVD ")+4);

  address=catx(" ",number,rest);

  cards;

12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

;

run;

New Contributor
Posts: 3

Re: Character Functions

Thank you! You have been very helpful. This gives me a good starting point. I will look into Dataflux.

New Contributor
Posts: 3

Re: Character Functions

Also, how can I separate the street and the street type if they run together like this:


2255 N N PROCEDURERD # 3

Respected Advisor
Posts: 3,124

Re: Character Functions

The following code will work on your data for now, however, if it is more than ONE incidence, then more complicated syntax will be needed:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

2255 N N PROCEDURERD # 3

;

data want;

   set have;

   retain _r _r1;

    _r=prxparse("/ +((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

    _r1=prxparse("/((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

   _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    if _len>0 then address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

    else do;

      _t3=prxchange("s/(((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +)/ \1/i", -1,_t2);

      call prxsubstr(_r,_t3,_start,_len);

      address=compbl(compress(left(substr(_t3,1,_start+_len-1)),'.'));

    end;

  drop _:;

  run;

  proc print;run;

Haikuo

PROC Star
Posts: 7,356

Re: Character Functions

: Well done!

Respected Advisor
Posts: 3,124

Re: Character Functions

Thanks, Art. I am trying. Here is an amendment to deal with multiple occurrence of 'RD, ST' etc, the code will only fix the last one:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

2255 N N PROCEDURERD # 3

2255 N N INSANEST PROCEDURERD # 3

;

data want;

   set have;

   retain _r _r1;

    _r=prxparse("/ +((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

    _r1=prxparse("/((st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd)) +/i");

   _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    if _len>0 then address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

    else do;

    _st=1;

    _stop=lengthn(_t2);

    call prxnext(_r1,_st,_stop,_t2,_p,_l);

      do while (_p > 0);

        _pend=_p;

        _lend=_l;

           call prxnext(_r1,_st,_stop,_t2,_p,_l);

      end;

      substr(_t2,_pend)=' '||substr(_t2,_pend,_lend);

      address=compbl(compress(left(_t2),'.'));

    end;

  drop _:;

  run;

  proc print;run;

Haikuo

Respected Advisor
Posts: 3,124

Re: Character Functions

If you don't mind some PRX function, you may save some typing, the following code works on your as-is data:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 SAS ST APT C

111 COMPUTER   GREEN LN APT1

2222  N. N PROCEDURE RD #2

  2255 N N PROCEDURE RD # 3

333 W   W. MACRO DR UNIT4

  44444 1/4 SEMICOLON AVE STE 5 STE 5

5555 FREQ BLVD 57

;

data want;

   set have;

   retain _r;

    _r=prxparse("/ +(st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd) +/i");

    _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

proc print;run;

Haikuo

PROC Star
Posts: 7,356

Re: Character Functions

The expressions would require some tightening.  e.g., the following would totally fail:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

;

data want;

   set have;

   retain _r;

    _r=prxparse("/ +(st)|(ln)|(rd)|(dr)|(ave)|(cir)|(blvd) +/i");

    _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

Respected Advisor
Posts: 3,124

Re: Character Functions

Good catch, Art. I don't really know why it acts that way, have to wait for a regular expression expert, for now, I had to put it as literal:

data have;

infile cards truncover;

input add $char100.;

cards;

   12345 1/2 START ST APT C

111 DRUMMER LN APT1

2222  N. N CIRCULAR RD #2

  2255 N N ALLNIGHT RD # 3

333 W   W. MACRODRIVEN DR UNIT4

  44444 1/4 DROPLIGHT AVE STE 5 STE 5

5555 STANDARDFREQ BLVD 57

;

data want;

   set have;

   retain _r;

    _r=prxparse("/( st )|( ln )|( rd )|( dr )|( ave )|( cir )|( blvd )/i");

   _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

  proc print;run;

Haikuo

PROC Star
Posts: 7,356

Re: Character Functions

I haven't tested the other conditions but, for the one I mentioned, I think you only have to refine the boundary definitions.  e.g.:

data want;

   set have;

   retain _r;

    _r=prxparse("/ +(st\b)|(ln\b)|(rd\b)|(dr\b)|(ave\b)|(cir\b)|(blvd\b) +/i");

    _t1=prxchange("s/ +\d\/\d +/ /", -1, add);

    _t2=prxchange("s/(( +e|w|s|n)( +|\. +)){2}/\1/i", -1, _t1);

    call prxsubstr(_r,_t2,_start,_len);

    address=compbl(compress(left(substr(_t2,1,_start+_len-1)),'.'));

  drop _:;

  run;

Respected Advisor
Posts: 3,124

Re: Character Functions

That is good. I know \b, while never have used it. Learn some. Thanks, Art. You are as sharp as when you are twenties.

Haikuo

PROC Star
Posts: 7,356

Re: Character Functions

: I presume you meant when I was in my twenties.  If so, interesting pick of time frame as I wasn't very sharp in my early twenties, in fact flunked out of community college during that time period.

Fortunately, I followed that up with enlisting in the Air Force for four years and, subsequently, used the GI Bill to go back to school and earn a BS, MS and PhD (picking up SAS along the way).

Super Contributor
Posts: 1,636

Re: Character Functions

Hi Art,

Out of curiosity, I heard about GI Bill in America. Is there also a GI Bill in Canada or you mean the GI Bill in America?

Thank you!

PROC Star
Posts: 7,356

Re: Character Functions

: America.  I was in the US Air Force.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 20 replies
  • 481 views
  • 6 likes
  • 5 in conversation