Search CHAR string for PO BOX variation (PERL)

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 128
Accepted Solution

Search CHAR string for PO BOX variation (PERL)

hi,

I'm looking to build a query to capture records of addresses containing either PoBox (and variations of), or CP (French PoBox).

 

a) my approach is to use regex PERL, along the lines of 

 

(prxmatch ("m/^CP|\bCP\b|\bC\bP\b|\dCP\b|\dC\bP|\dC\bP\b|\bCP\d|\bC\bP\d| CP |C.P.|^BOX|^B0X|\bBOX\b|\bB\bO\bX|\bB\bOX\b|\dBOX\b|BOX\b\d|BOX \d| BOX | BOX|BOX\d|\bB0X\b|\bB\b0\Bx|\bB\b0\bX\b|B0X|POBOX|\bPO\b|\bP \bO\b/oi",ACCT_ADDR_LINE_1_DESC)> 0 )

(note that there is some redundancy that I need to review for in that script). If someone has a different approach to suggest, please advise?

 

b) And If someone is familiar with PERL, what is suggested syntax to state if say P and O are separated by zero, one or two lengths (of any content, such as to capture :

 

PO

P.O

P  O

P_O

P-O

P1.O

)

 

 

thank you in advance, i'll review suggestions/questions Aug 7th. 


Accepted Solutions
Solution
2 weeks ago
PROC Star
Posts: 549

Re: Search CHAR string for PO BOX variation (PERL)

the answer is pretty near to @ChrisNZ. with sample you have provided this should work

data have;
informat val $20.;
input id val & $ @@;
datalines;
32237
PO BOX 133
32238
1046 RUE PRINCIPALE
32239
CP 2732
32240
PO BOX 515
32241
PO BOX 8565 STN MAIN
32242
PO BOX 2
32243
PO BOX 252
32244
PO BOX 447
32245
PO BOX 121 STN MAIN
32246
PO BOX 518
32247
PO BOX 82
32248
PO BOX 3185
32249
PO BOX 468
32250
PO BOX 26
32251
CP 1963
32252
PO BOX 475
32253
PO BOX 33
32254
PO BOX 254
32255
PO BOX 277
32256
PO BOX 703
32257
1833 CONCEPTION BAY HWY
32258
PO BOX 733
32259
PO BOX 852
32260
PO BOX 41
32261
30 RTE PRINCIPALE O
32262
PO BOX 283
32263
1064 RUE PRINCIPALE
32264
PO BOX 3125
;

data want;
set have;
if prxmatch("m/^P\s*O\s+BOX \d+/oi",val) > 0 then postal ="English";
else if prxmatch("m/^C\s*P\s+\d+/oi",val) > 0 then postal = "French";
else postal = "unknown";
run;

View solution in original post


All Replies
PROC Star
Posts: 549

Re: Search CHAR string for PO BOX variation (PERL)

if you can give some test data of your input and the output you want in form a data Step. It will be an easy task for someone to help.

Frequent Contributor
Posts: 128

Re: Search CHAR string for PO BOX variation (PERL)

Frequent Contributor
Posts: 128

Re: Search CHAR string for PO BOX variation (PERL)

[ Edited ]

hi kiranv_, i've attached sample output using following code: 

ods output sql_results=a1 (drop=prd_cd);
proc sql NUMBER;  
select *
from  lib.addresstable (keep=bill_addr_line1 prd_cd) where prd_cd='WANT'   
AND 
(prxmatch ("m/^CP|\bCP\b|\bC\bP\b|\dCP\b|\dC\bP|\dC\bP\b|\bCP\d|\bC\bP\d| CP |C.P.|^BOX|^B0X|\bBOX\b|\bB\bO\bX|\bB\bOX\b|\dBOX\b|BOX\b\d|BOX \d| BOX | BOX|BOX\d|\bB0X\b|\bB\b0\Bx|\bB\b0\bX\b|B0X|POBOX|\bPO\b|\bP \bO\b/oi",BILL_ADDR_LINE1)> 0 ) 
;quit;

data b; set a1 (obs=40000);run;

Problem with output:

-It has many false positives (row 15,199 COPPERTREE CRT; row  111,169 RUE PRINCIPALE) and 

-It could potentially miss some addresses such as C.P*452 (or some syntactic variation I haven't considered.

 

For now, I'll proceed to tinker with the code.... Was just curious if someone ever ran a query over very large volume and arrived with an acceptable/low error rate (maybe using machine learning?).

Solution
2 weeks ago
PROC Star
Posts: 549

Re: Search CHAR string for PO BOX variation (PERL)

the answer is pretty near to @ChrisNZ. with sample you have provided this should work

data have;
informat val $20.;
input id val & $ @@;
datalines;
32237
PO BOX 133
32238
1046 RUE PRINCIPALE
32239
CP 2732
32240
PO BOX 515
32241
PO BOX 8565 STN MAIN
32242
PO BOX 2
32243
PO BOX 252
32244
PO BOX 447
32245
PO BOX 121 STN MAIN
32246
PO BOX 518
32247
PO BOX 82
32248
PO BOX 3185
32249
PO BOX 468
32250
PO BOX 26
32251
CP 1963
32252
PO BOX 475
32253
PO BOX 33
32254
PO BOX 254
32255
PO BOX 277
32256
PO BOX 703
32257
1833 CONCEPTION BAY HWY
32258
PO BOX 733
32259
PO BOX 852
32260
PO BOX 41
32261
30 RTE PRINCIPALE O
32262
PO BOX 283
32263
1064 RUE PRINCIPALE
32264
PO BOX 3125
;

data want;
set have;
if prxmatch("m/^P\s*O\s+BOX \d+/oi",val) > 0 then postal ="English";
else if prxmatch("m/^C\s*P\s+\d+/oi",val) > 0 then postal = "French";
else postal = "unknown";
run;
Frequent Contributor
Posts: 128

Re: Search CHAR string for PO BOX variation (PERL)

Thanks to you, and ChrisNZ. I will study this...
Esteemed Advisor
Posts: 5,628

Re: Search CHAR string for PO BOX variation (PERL)

It would be preferable to first identify the region and/or language of the address and then use the corresponding conventions to decipher the text. 

PG
Frequent Contributor
Posts: 128

Re: Search CHAR string for PO BOX variation (PERL)

Country is bilingual English and French.
For now, only looking to capture when any variation of 'POBOX' (when customer is English speaker)
or 'CP' such (when customer is French speaker) from a table where these records are mixed.

I will reply to kiranv_ above with more examples.
Super User
Posts: 2,519

Re: Search CHAR string for PO BOX variation (PERL)

syntax to state if P and O are separated by zero, one or two spaces

Maybe this  \bP\s*O\b  or  \bP\s{0,2}O\b

 

of any content 

Maybe this   \bP[^A-Za-z]*?O\b  ( ? makes a lazy pattern match, any "content" but letters)

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 126 views
  • 2 likes
  • 4 in conversation