hi,
I'm looking to build a query to capture records of addresses containing either PoBox (and variations of), or CP (French PoBox).
a) my approach is to use regex PERL, along the lines of
(prxmatch ("m/^CP|\bCP\b|\bC\bP\b|\dCP\b|\dC\bP|\dC\bP\b|\bCP\d|\bC\bP\d| CP |C.P.|^BOX|^B0X|\bBOX\b|\bB\bO\bX|\bB\bOX\b|\dBOX\b|BOX\b\d|BOX \d| BOX | BOX|BOX\d|\bB0X\b|\bB\b0\Bx|\bB\b0\bX\b|B0X|POBOX|\bPO\b|\bP \bO\b/oi",ACCT_ADDR_LINE_1_DESC)> 0 )
(note that there is some redundancy that I need to review for in that script). If someone has a different approach to suggest, please advise?
b) And If someone is familiar with PERL, what is suggested syntax to state if say P and O are separated by zero, one or two lengths (of any content, such as to capture :
PO
P.O
P O
P_O
P-O
P1.O
)
thank you in advance, i'll review suggestions/questions Aug 7th.
the answer is pretty near to @ChrisNZ. with sample you have provided this should work
data have;
informat val $20.;
input id val & $ @@;
datalines;
32237
PO BOX 133
32238
1046 RUE PRINCIPALE
32239
CP 2732
32240
PO BOX 515
32241
PO BOX 8565 STN MAIN
32242
PO BOX 2
32243
PO BOX 252
32244
PO BOX 447
32245
PO BOX 121 STN MAIN
32246
PO BOX 518
32247
PO BOX 82
32248
PO BOX 3185
32249
PO BOX 468
32250
PO BOX 26
32251
CP 1963
32252
PO BOX 475
32253
PO BOX 33
32254
PO BOX 254
32255
PO BOX 277
32256
PO BOX 703
32257
1833 CONCEPTION BAY HWY
32258
PO BOX 733
32259
PO BOX 852
32260
PO BOX 41
32261
30 RTE PRINCIPALE O
32262
PO BOX 283
32263
1064 RUE PRINCIPALE
32264
PO BOX 3125
;
data want;
set have;
if prxmatch("m/^P\s*O\s+BOX \d+/oi",val) > 0 then postal ="English";
else if prxmatch("m/^C\s*P\s+\d+/oi",val) > 0 then postal = "French";
else postal = "unknown";
run;
if you can give some test data of your input and the output you want in form a data Step. It will be an easy task for someone to help.
hi kiranv_, i've attached sample output using following code:
ods output sql_results=a1 (drop=prd_cd);
proc sql NUMBER;
select *
from lib.addresstable (keep=bill_addr_line1 prd_cd) where prd_cd='WANT'
AND
(prxmatch ("m/^CP|\bCP\b|\bC\bP\b|\dCP\b|\dC\bP|\dC\bP\b|\bCP\d|\bC\bP\d| CP |C.P.|^BOX|^B0X|\bBOX\b|\bB\bO\bX|\bB\bOX\b|\dBOX\b|BOX\b\d|BOX \d| BOX | BOX|BOX\d|\bB0X\b|\bB\b0\Bx|\bB\b0\bX\b|B0X|POBOX|\bPO\b|\bP \bO\b/oi",BILL_ADDR_LINE1)> 0 )
;quit;
data b; set a1 (obs=40000);run;
Problem with output:
-It has many false positives (row 15,199 COPPERTREE CRT; row 111,169 RUE PRINCIPALE) and
-It could potentially miss some addresses such as C.P*452 (or some syntactic variation I haven't considered.
For now, I'll proceed to tinker with the code.... Was just curious if someone ever ran a query over very large volume and arrived with an acceptable/low error rate (maybe using machine learning?).
the answer is pretty near to @ChrisNZ. with sample you have provided this should work
data have;
informat val $20.;
input id val & $ @@;
datalines;
32237
PO BOX 133
32238
1046 RUE PRINCIPALE
32239
CP 2732
32240
PO BOX 515
32241
PO BOX 8565 STN MAIN
32242
PO BOX 2
32243
PO BOX 252
32244
PO BOX 447
32245
PO BOX 121 STN MAIN
32246
PO BOX 518
32247
PO BOX 82
32248
PO BOX 3185
32249
PO BOX 468
32250
PO BOX 26
32251
CP 1963
32252
PO BOX 475
32253
PO BOX 33
32254
PO BOX 254
32255
PO BOX 277
32256
PO BOX 703
32257
1833 CONCEPTION BAY HWY
32258
PO BOX 733
32259
PO BOX 852
32260
PO BOX 41
32261
30 RTE PRINCIPALE O
32262
PO BOX 283
32263
1064 RUE PRINCIPALE
32264
PO BOX 3125
;
data want;
set have;
if prxmatch("m/^P\s*O\s+BOX \d+/oi",val) > 0 then postal ="English";
else if prxmatch("m/^C\s*P\s+\d+/oi",val) > 0 then postal = "French";
else postal = "unknown";
run;
It would be preferable to first identify the region and/or language of the address and then use the corresponding conventions to decipher the text.
syntax to state if P and O are separated by zero, one or two spaces
Maybe this \bP\s*O\b or \bP\s{0,2}O\b
of any content
Maybe this \bP[^A-Za-z]*?O\b ( ? makes a lazy pattern match, any "content" but letters)
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.