I am new to using Regex expressions using SAS and trying to parse out the a phone number from a string of text. I read through this document and also this one. This is what I have so far, and I'm not getting the result I'm looking for:
DATA in_string;
INPUT @1 STRING $50.;
DATALINES;
Here is a phone number 111-123-4567
other type of number:TT44679
111-234-6583
number is (231)-390-5710
this has two: 123-523-4545 222-333-4444
;
data out_string ;
set in_string;
if _n_ = 1 then do;
phone_prx= prxparse("/ \(?(\d\d\d).(\d\d\d).(\d\d\d\d)/");
end;
retain phone_prx ;
pos_phone= prxmatch(phone_prx,string);
call prxposn(phone_prx, 1, areacode_pos);
call prxposn(phone_prx, 2, phone1_pos);
call prxposn(phone_prx, 3, phone2_pos);
length phonenumber $15;
phonenumber= substrn(string, areacode_pos, 3) || "-" ||
substrn(string, phone1_pos, 3) || "-" ||
substrn(string,phone2_pos, 4);
run;
The output I would like is:
obs phonenumber
1 111-123-4567
2 111-234-6583
3 (231)-390-5710
4 123-523-4545 222-333-4444
**I've tried editing the text to make the code more clear, but some spacing isn't working.
DATA in_string;
INPUT STRING $50.;
DATALINES;
Here is a phone number 111-123-4567
other type of number:TT44679111-234-6583
number is (231)-390-5710
this has two: 123-523-4545 222-333-4444
;
data want;
set in_string;
obs+1;
pid=prxparse('/\(?\d{3}\)?-\d{3}-\d{4}/');
s=1;e=length(string);
call prxnext(pid,s,e,string,p,l);
do while(p>0);
want=substr(string,p,l);
output;
call prxnext(pid,s,e,string,p,l);
end;
drop s e p l pid;
run;
DATA in_string;
INPUT STRING $50.;
DATALINES;
Here is a phone number 111-123-4567
other type of number:TT44679111-234-6583
number is (231)-390-5710
this has two: 123-523-4545 222-333-4444
;
data want;
set in_string;
obs+1;
pid=prxparse('/\(?\d{3}\)?-\d{3}-\d{4}/');
s=1;e=length(string);
call prxnext(pid,s,e,string,p,l);
do while(p>0);
want=substr(string,p,l);
output;
call prxnext(pid,s,e,string,p,l);
end;
drop s e p l pid;
run;
Extensions? Do you want an extension if found?
Country codes or other country dialing strings?
Other separators than dash or none at all? Or Dash and space like (231) 390-5710
Country codes and extensions shouldn't arise, but I guess I would want to be as flexible as possible when it comes to formats. Parenthesis, spaces/no spaces, dashes, could all be used. How is it best to approach casting a wide net to find the phone numbers?
@MB_Analyst wrote:
Country codes and extensions shouldn't arise, but I guess I would want to be as flexible as possible when it comes to formats. Parenthesis, spaces/no spaces, dashes, could all be used. How is it best to approach casting a wide net to find the phone numbers?
Haven't really got a suggestion on specific issues other than possibly sequential processing of different rules to find different candidates.
I brought up some issues that appeared in phone number fields from when I worked with a company where we needed to parse numbers to create calling strings for CATI software. Even though the data came from one company the way their staff recorded things could vary quite a bit and these are some of the simpler cases.
I didn't for instance bring up the "or" such as 555-123-4567/8855 where either of the last 4 digits could be used.
Extensions could well be of concern if folks are entering them in a form of 555-123-4567-1 (or more digits in the last place) or with one of the other delimiters.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.