Hi there,
I need to remove the apartment number (secondary unit) in some dirty data. The apartment number comes in many different flavors such as shown in the code below. I am trying to use PRXCHANGE to remove all matches, along with a negative lookahead such as (?!TH) to not match any street names such as '108TH'. The negative lookbehind can be expanded to (?!ST|ND|RD|TH) for firST, secoND, thiRD, and any i-th.
I need help concocting a regex for this. So far my regex is not working, I can get it to take all words with numbers out, but can't get the negative look ahead to work in conjunction.
Currently, I have "s/\b(([#A-Z]*)([0-9]+)([#A-Z]*))(?!ST|ND|RD|TH)\bs//"
s/ is required for PRXCHANGE. I have not gotten it to work otherwise.
\b is for the word boundary
([#A-Z]*) is to match any number of alpha characters plus #
([0-9]+) the word needs to have at least one numeric character
([#A-Z]*) the alpha characters can also appear afterwards
(?!ST|ND|RD|TH) is the negative look ahead
data dirty;
var='W EXAMPLE ROAD #707'; output;
var='N 108TH STREET'; output;
var='S MAIN #D44'; output;
var='SOUTH OAK ROAD 1C'; output;
var='EAST MAIN STREET APT 4 B'; output;
run;
data clean;
set dirty;
*Remove string matches with numeric optionally mixed with alpha (to remove apartment numbers such as: 3B, 3, B3, #3, 3#, #3B);
*^(?!TH) specifies not to match words ending with 'TH';
*s/ is required at both ends. I dont know why, something about replacement text.;
var2 = PRXCHANGE("s/\b(([#A-Z]*)([0-9]+)([#A-Z]*))(?!TH)\bs//",-1, var);
run;
The output should be:
'W EXAMPLE ROAD'
'N 108TH STREET'
'S MAIN'
'SOUTH OAK ROAD'
'EAST MAIN STREET APT'
Edit: to add expected output and change 'look behind' to 'look ahead'
what should be the output from the data you have given
Sorry, I should have included this in the original post. It should look like this:
'W EXAMPLE ROAD'
'N 108TH STREET'
'S MAIN'
'SOUTH OAK ROAD'
'EAST MAIN STREET APT'I already have code to remove second unit designators such as 'APT', 'APARTMENT', 'UNIT', ect.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.