Hello all,
I am trying to do some address matching and am attempting pre-clean up and want to remove (ST|ND|RD|TH) from any address line to have better matching where some add it some do not... of course I have to match in the case where the word starts with a number not a street name like Sand street.
have:
1104 E 148TH ST
1150 W 10TH ST
6836 34TH ST
4260 54TH ST
315 W 14TH ST
215 9TH ST
5601 47TH AVE
115 1ST AVE
1100 47TH ST
392 16TH ST
379 12TH ST
5550 N 1ST ST
4411 N 1ST ST
1325 E 4TH ST
1301 W 182ND ST
1230 WEST 177TH STREET
1023 14TH ST
750 W 10TH ST
321 16TH ST
1400 E 8TH ST
103 9TH STREET
125 EAST 6TH
418 W 8TH ST
131 E 28TH AVE
2565 E 58TH ST
631 12TH STREET
312 W 7TH ST
205 S 1ST ST
1900 18TH AVENUE
42145 30TH ST W
44514 20TH ST W
44701 32ND ST W
1350 3RD STREET
want
1104 E 148 ST
1150 W 10 ST
6836 34 ST
4260 54 ST
315 W 14 ST
215 9 ST
5601 47 AVE
115 1 AVE
1100 47 ST
392 16 ST
379 12 ST
5550 N 1 ST
4411 N 1 ST
1325 E 4 ST
1301 W 182 ST
1230 WEST 177 STREET
1023 14 ST
750 W 10 ST
321 16 ST
1400 E 8 ST
103 9 STREET
125 EAST 6
418 W 8 ST
131 E 28 AVE
2565 E 58 ST
631 12 STREET
312 W 7 ST
205 S 1 ST
1900 18 AVENUE
42145 30 ST W
44514 20 ST W
44701 32 ST W
1350 3 STREET
proc sql;
create table my_clean as
SELECT Id
,Name
,Country
,Address1
/* This code does not work */
,compbl(prxchange('s/\s\d{1,}(ST|ND|RD|TH)\s/\s\d{1,}()\s/', -1, upper(Address1))) as clean_Address1
/* This line works just fine to find the bad guys */
,prxmatch('/\s\d{1,}(ST|ND|RD|TH)\s/', upper(Address1)) as find_numsuffix1
,Address2
,compbl(upper(Address2)) as clean_Address2
/* This line works just fine to find the bad guys too */
,prxmatch('/\s\d{1,}(ST|ND|RD|TH)\s/', upper(Address2)) as find_numsuffix2
,City
,State
,Zip
FROM my_source
;QUIT;
example junk I am getting back:
5550 N\s\d{1,}()\sST
4411 N\s\d{1,}()\sST
1325 E\s\d{1,}()\sST
1301 W\s\d{1,}()\sST
1230 WEST\s\d{1,}()\sSTREET
1023\s\d{1,}()\sST
750 W\s\d{1,}()\sST
321\s\d{1,}()\sST
1400 E\s\d{1,}()\sST
103\s\d{1,}()\sSTREET
125 EAST\s\d{1,}()\s
418 W\s\d{1,}()\sST
131 E\s\d{1,}()\sAVE
2565 E\s\d{1,}()\sST
631\s\d{1,}()\sSTREET
312 W\s\d{1,}()\sST
205 S\s\d{1,}()\sST
1900\s\d{1,}()\sAVENUE
42145\s\d{1,}()\sST W
44514\s\d{1,}()\sST W
44701\s\d{1,}()\sST W
1350\s\d{1,}()\sSTREET
44662\s\d{1,}()\sST W
235 E\s\d{1,}()\sST
4400 E\s\d{1,}()\sST
325 E\s\d{1,}()\sST
1575 WEST\s\d{1,}()\sSTREET
in this case it is taking my replacement string as literal and It needs to bring back the part of the number and space it matched on leaving the
ST | ND | RD | TH behind. I am not sure I have seen an example like this before. if so I have forgotten. I hope there is an easer way, that does not require a SAS server module because we do not have the software or budget.
PS data step or other options are okay too. I just mainly write proc sql code... TIA -KJ
Hello @kjohnsonm,
@kjohnsonm wrote:
It needs to bring back the part of the number and space it matched on leaving the ST | ND | RD | TH behind.
One way to achieve this is to put those two parts in parentheses in the search pattern and then refer to these "capturing groups" by their numbers (i.e. 1 and 3) in the form $1$3 or \1\3.
's/(\s\d{1,})(ST|ND|RD|TH)(\s)/$1$3/'
Edit: Another option is to use look-ahead and look-behind behavior:
's/(?<=\d)(ST|ND|RD|TH)(?=\s)//'
That is, the first and third part of the pattern are not included in the final match so that the replacement pattern (i.e., for (ST|ND|RD|TH)) is just empty. Those look-ahead and look-behind assertions work only with fixed-width patterns, which is why I had to reduce the variable-width pattern \s\d{1,} to \d. For your sample data it works as well, but the first suggestion is closer to your original plan.
Hello @kjohnsonm,
@kjohnsonm wrote:
It needs to bring back the part of the number and space it matched on leaving the ST | ND | RD | TH behind.
One way to achieve this is to put those two parts in parentheses in the search pattern and then refer to these "capturing groups" by their numbers (i.e. 1 and 3) in the form $1$3 or \1\3.
's/(\s\d{1,})(ST|ND|RD|TH)(\s)/$1$3/'
Edit: Another option is to use look-ahead and look-behind behavior:
's/(?<=\d)(ST|ND|RD|TH)(?=\s)//'
That is, the first and third part of the pattern are not included in the final match so that the replacement pattern (i.e., for (ST|ND|RD|TH)) is just empty. Those look-ahead and look-behind assertions work only with fixed-width patterns, which is why I had to reduce the variable-width pattern \s\d{1,} to \d. For your sample data it works as well, but the first suggestion is closer to your original plan.
Or simply:
prxchange('s/\b(\d+)(st|nd|rd|th)\b/\1/i', 1, Address1)
\b matches any word boundary
\d+ matches one or more digits
\1 brings the match from the first set of parentheses (i.e. the digits)
the i suffix makes the match case insensitive
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.