- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am new to using Regex expressions using SAS and trying to parse out the a phone number from a string of text. I read through this document and also this one. This is what I have so far, and I'm not getting the result I'm looking for:
DATA in_string;
INPUT @1 STRING $50.;
DATALINES;Here is a phone number 111-123-4567other type of number:TT44679111-234-6583number is (231)-390-5710this has two: 123-523-4545 222-333-4444
;
data out_string ;
set in_string;
if _n_ = 1 then do;
phone_prx= prxparse("/ \(?(\d\d\d).(\d\d\d).(\d\d\d\d)/");
end;
retain phone_prx ;
pos_phone= prxmatch(phone_prx,string);
call prxposn(phone_prx, 1, areacode_pos);
call prxposn(phone_prx, 2, phone1_pos);
call prxposn(phone_prx, 3, phone2_pos);
length phonenumber $15;
phonenumber= substrn(string, areacode_pos, 3) || "-" ||
substrn(string, phone1_pos, 3) || "-" ||
substrn(string,phone2_pos, 4);
run;
The output I would like is:
obs phonenumber
1 111-123-4567
2 111-234-6583
3 (231)-390-5710
4 123-523-4545 222-333-4444
**I've tried editing the text to make the code more clear, but some spacing isn't working.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
DATA in_string;
INPUT STRING $50.;
DATALINES;
Here is a phone number 111-123-4567
other type of number:TT44679111-234-6583
number is (231)-390-5710
this has two: 123-523-4545 222-333-4444
;
data want;
set in_string;
obs+1;
pid=prxparse('/\(?\d{3}\)?-\d{3}-\d{4}/');
s=1;e=length(string);
call prxnext(pid,s,e,string,p,l);
do while(p>0);
want=substr(string,p,l);
output;
call prxnext(pid,s,e,string,p,l);
end;
drop s e p l pid;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
DATA in_string;
INPUT STRING $50.;
DATALINES;
Here is a phone number 111-123-4567
other type of number:TT44679111-234-6583
number is (231)-390-5710
this has two: 123-523-4545 222-333-4444
;
data want;
set in_string;
obs+1;
pid=prxparse('/\(?\d{3}\)?-\d{3}-\d{4}/');
s=1;e=length(string);
call prxnext(pid,s,e,string,p,l);
do while(p>0);
want=substr(string,p,l);
output;
call prxnext(pid,s,e,string,p,l);
end;
drop s e p l pid;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Extensions? Do you want an extension if found?
Country codes or other country dialing strings?
Other separators than dash or none at all? Or Dash and space like (231) 390-5710
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Country codes and extensions shouldn't arise, but I guess I would want to be as flexible as possible when it comes to formats. Parenthesis, spaces/no spaces, dashes, could all be used. How is it best to approach casting a wide net to find the phone numbers?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@MB_Analyst wrote:
Country codes and extensions shouldn't arise, but I guess I would want to be as flexible as possible when it comes to formats. Parenthesis, spaces/no spaces, dashes, could all be used. How is it best to approach casting a wide net to find the phone numbers?
Haven't really got a suggestion on specific issues other than possibly sequential processing of different rules to find different candidates.
I brought up some issues that appeared in phone number fields from when I worked with a company where we needed to parse numbers to create calling strings for CATI software. Even though the data came from one company the way their staff recorded things could vary quite a bit and these are some of the simpler cases.
I didn't for instance bring up the "or" such as 555-123-4567/8855 where either of the last 4 digits could be used.
Extensions could well be of concern if folks are entering them in a form of 555-123-4567-1 (or more digits in the last place) or with one of the other delimiters.