remove string from within string

Reply
Valued Guide
Posts: 858

remove string from within string

I am trying to match addresses and want to remove anything that may look like unit or apt.  I am hoping to get the following from the first line, I'll probably be doing the same thing next and removing ave lane or anything else that looks like that:

100CENTERAVE1

data have;

infile cards dsd;

informat address $25.;

input address;

cards;

100 Center Ave Apt 1

200 Center Ave Unit 1

;

run;

Here is what I have so far but I need to remove the Apt and Unit too.

data want;

set have;

new_address = compress(upcase(address));

run;

Super User
Posts: 11,105

Re: remove string from within string

Good luck with this. At least you are getting a consistent case to search

When I was tasked with a similar task some things to help:

Use Indexw or Findw to find the indicator words. Use the position returned in a Substr call (which gets fun if you get:

Apt 123 111 Main St.

If your data is as bad as the stuff I dealt with you'll want to search for

APT APT.   (note there is a period there) SPACE (trailer parks) and SP or SPC as well as TRAILER TRLR and a number of permutations

UNIT ROOM possibly BUILDING or BLDG

You may also have NUMBER or NO as 100 Center Ave No 3, plus the # used for number, 100 Center Ave # 3

or just a single Letter: 100 Center Ave Q

a 1/2 may be indicative as well   100 1/2 Center Ave

data want;

set have;

new_address =upcase(address); /* Because word boundaries may be important this may not be the best approach, working with a modified version

of your original variable is good though.*/

pos = indexw(new_address, 'APT');

/* if the apartment indicator is always last then something like*/

if pos > 0 then new_address = substr(new_address,1, pos-2);

run;

This could be done in a loop such as

length word $ 10.;

do word = 'APT','UNIT','SPACE','TRAILER';

     pos = indexw(new_address, WORD);

     /* if the apartment indicator is always last then something like*/

     if pos > 0 then new_address = substr(new_address,1, pos-2);

end;

The case of #3 gets needs to be handled differently as you are searching for a character that may not be separated from the number so indexw or findw aren't the best. Index searching for '#' may work but be prepared for RR #4 where the number is the rural route and not an apartment, or HWY #55 (highway or avenue or ...)

When it comes to Ave (Avenue, Aven Av ad nauseum) you may also need to be careful with names like Avenue B, especially if you are removing Street as you may have addresses on B Street. Then a lone B in the street variable may lose too much information.

Super User
Posts: 5,353

Re: remove string from within string

Just a small piece of the puzzle ...  I would not compress the data as an initial step.  How will you differentiate APT and UNIT within CAPTURE AVE or UNITED WAY?

Ask a Question
Discussion stats
  • 2 replies
  • 241 views
  • 7 likes
  • 3 in conversation