BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
brulard
Pyrite | Level 9

hi,

I'm looking to build a query to capture records of addresses containing either PoBox (and variations of), or CP (French PoBox).

 

a) my approach is to use regex PERL, along the lines of 

 

(prxmatch ("m/^CP|\bCP\b|\bC\bP\b|\dCP\b|\dC\bP|\dC\bP\b|\bCP\d|\bC\bP\d| CP |C.P.|^BOX|^B0X|\bBOX\b|\bB\bO\bX|\bB\bOX\b|\dBOX\b|BOX\b\d|BOX \d| BOX | BOX|BOX\d|\bB0X\b|\bB\b0\Bx|\bB\b0\bX\b|B0X|POBOX|\bPO\b|\bP \bO\b/oi",ACCT_ADDR_LINE_1_DESC)> 0 )

(note that there is some redundancy that I need to review for in that script). If someone has a different approach to suggest, please advise?

 

b) And If someone is familiar with PERL, what is suggested syntax to state if say P and O are separated by zero, one or two lengths (of any content, such as to capture :

 

PO

P.O

P  O

P_O

P-O

P1.O

)

 

 

thank you in advance, i'll review suggestions/questions Aug 7th. 

1 ACCEPTED SOLUTION

Accepted Solutions
kiranv_
Rhodochrosite | Level 12

the answer is pretty near to @ChrisNZ. with sample you have provided this should work

data have;
informat val $20.;
input id val & $ @@;
datalines;
32237
PO BOX 133
32238
1046 RUE PRINCIPALE
32239
CP 2732
32240
PO BOX 515
32241
PO BOX 8565 STN MAIN
32242
PO BOX 2
32243
PO BOX 252
32244
PO BOX 447
32245
PO BOX 121 STN MAIN
32246
PO BOX 518
32247
PO BOX 82
32248
PO BOX 3185
32249
PO BOX 468
32250
PO BOX 26
32251
CP 1963
32252
PO BOX 475
32253
PO BOX 33
32254
PO BOX 254
32255
PO BOX 277
32256
PO BOX 703
32257
1833 CONCEPTION BAY HWY
32258
PO BOX 733
32259
PO BOX 852
32260
PO BOX 41
32261
30 RTE PRINCIPALE O
32262
PO BOX 283
32263
1064 RUE PRINCIPALE
32264
PO BOX 3125
;

data want;
set have;
if prxmatch("m/^P\s*O\s+BOX \d+/oi",val) > 0 then postal ="English";
else if prxmatch("m/^C\s*P\s+\d+/oi",val) > 0 then postal = "French";
else postal = "unknown";
run;

View solution in original post

8 REPLIES 8
kiranv_
Rhodochrosite | Level 12

if you can give some test data of your input and the output you want in form a data Step. It will be an easy task for someone to help.

brulard
Pyrite | Level 9

hi kiranv_, i've attached sample output using following code: 

ods output sql_results=a1 (drop=prd_cd);
proc sql NUMBER;  
select *
from  lib.addresstable (keep=bill_addr_line1 prd_cd) where prd_cd='WANT'   
AND 
(prxmatch ("m/^CP|\bCP\b|\bC\bP\b|\dCP\b|\dC\bP|\dC\bP\b|\bCP\d|\bC\bP\d| CP |C.P.|^BOX|^B0X|\bBOX\b|\bB\bO\bX|\bB\bOX\b|\dBOX\b|BOX\b\d|BOX \d| BOX | BOX|BOX\d|\bB0X\b|\bB\b0\Bx|\bB\b0\bX\b|B0X|POBOX|\bPO\b|\bP \bO\b/oi",BILL_ADDR_LINE1)> 0 ) 
;quit;

data b; set a1 (obs=40000);run;

Problem with output:

-It has many false positives (row 15,199 COPPERTREE CRT; row  111,169 RUE PRINCIPALE) and 

-It could potentially miss some addresses such as C.P*452 (or some syntactic variation I haven't considered.

 

For now, I'll proceed to tinker with the code.... Was just curious if someone ever ran a query over very large volume and arrived with an acceptable/low error rate (maybe using machine learning?).

kiranv_
Rhodochrosite | Level 12

the answer is pretty near to @ChrisNZ. with sample you have provided this should work

data have;
informat val $20.;
input id val & $ @@;
datalines;
32237
PO BOX 133
32238
1046 RUE PRINCIPALE
32239
CP 2732
32240
PO BOX 515
32241
PO BOX 8565 STN MAIN
32242
PO BOX 2
32243
PO BOX 252
32244
PO BOX 447
32245
PO BOX 121 STN MAIN
32246
PO BOX 518
32247
PO BOX 82
32248
PO BOX 3185
32249
PO BOX 468
32250
PO BOX 26
32251
CP 1963
32252
PO BOX 475
32253
PO BOX 33
32254
PO BOX 254
32255
PO BOX 277
32256
PO BOX 703
32257
1833 CONCEPTION BAY HWY
32258
PO BOX 733
32259
PO BOX 852
32260
PO BOX 41
32261
30 RTE PRINCIPALE O
32262
PO BOX 283
32263
1064 RUE PRINCIPALE
32264
PO BOX 3125
;

data want;
set have;
if prxmatch("m/^P\s*O\s+BOX \d+/oi",val) > 0 then postal ="English";
else if prxmatch("m/^C\s*P\s+\d+/oi",val) > 0 then postal = "French";
else postal = "unknown";
run;
brulard
Pyrite | Level 9
Thanks to you, and ChrisNZ. I will study this...
PGStats
Opal | Level 21

It would be preferable to first identify the region and/or language of the address and then use the corresponding conventions to decipher the text. 

PG
brulard
Pyrite | Level 9
Country is bilingual English and French.
For now, only looking to capture when any variation of 'POBOX' (when customer is English speaker)
or 'CP' such (when customer is French speaker) from a table where these records are mixed.

I will reply to kiranv_ above with more examples.
ChrisNZ
Tourmaline | Level 20

syntax to state if P and O are separated by zero, one or two spaces

Maybe this  \bP\s*O\b  or  \bP\s{0,2}O\b

 

of any content 

Maybe this   \bP[^A-Za-z]*?O\b  ( ? makes a lazy pattern match, any "content" but letters)

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 854 views
  • 2 likes
  • 4 in conversation