BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bbb_NG
Fluorite | Level 6

Hi,

I have 2 fields ,'Tel No' and 'Mobile Phone Number',

in which, Mobile phone number might exist in  Tel no field.

values might contain()*-+/ and '0-9' ,values may have prefix like zone ,country code.

there's possibility that customer input 2 Mobile phone numbers altogether as one field value.

General speaking, there's all possibility for the input for MP.

Criteria for selection,

as a valid MP number, it should start with

'133','139','150' etc(about 20 numberic prefix)

with a total length of the mp number equals 11,say 13333697005.

I first

          (SUBSTRn(t1.'Tel No'n, 1,11)) AS tel1left,
            (SUBSTRN(t1.'Tel No'n, LENGTH(t1.'Tel No'n)-11, 11)) AS tel1right,

    (SUBSTRn(t1.'Mobile/Pager No'n, 1,11)) AS MPleft,
            (SUBSTRN(t1.'Mobile/Pager No'n, LENGTH(t1.'Mobile/Pager No'n)-11, 11)) AS MPright

pick out those 4 fields.

then

    length((COMPRESS(t1.tel1left, '()-/,*&+'))) AS t1leftcomlen,

   length((COMPRESS(t1.tel1right, '()-/,*&+'))) AS t1rightcomlen,

   length((COMPRESS(t1.MPleft, '()-/,*&+'))) AS mpleftcomlen,

   length((COMPRESS(t1.MPright, '()-/,*&+'))) AS mprightcomlen

find those compressed length=11

and I have a data MP_prefix.

can any one give hand how to pick out valid Mobile phone through the above 4 fields if exist?

if there's 2 valid, only the first found will be needed.

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
FriedEgg
SAS Employee

data foo;

infile cards truncover;

input id (tel mp) (:$31.);

cards;

1 8529634332 85298889175

2 . 85298842876

3 +88028852469/+852-23844511 +88-01713015360/+86-13747881289

4 008529016918 13348808380

5 13916588388 13816884388

6 . 13901293668

7 008602162139806 8615821668852

;

run;

data bar;

retain _pid;

if _n_=1 then _pid=prxparse('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/');

set foo;

_tels=catx('/',of tel mp);

_start=1;

_stop=length(_tels);

call prxnext(_pid,_start,_stop,_tels,_pos,_len);

  if _pos=0 then output;

  do while(_pos>0);

   want=compress(substr(_tels,_pos,_len));

   output;

   call prxnext(_pid,_start,_stop,_tels,_pos,_len);

  end;

drop _:;

run;

proc sort data=bar nodupkey; by id want; run;

proc print data=bar noobs; run;

idtelmpwant
1852963433285298889175
2 85298842876
3+88028852469/+852-23844511+88-01713015360/+86-1374788128913747881289
40085290169181334880838013348808380
5139165883881381688438813816884388
5139165883881381688438813916588388
6 1390129366813901293668
7008602162139806861582166885215821668852

Please note that there are many cases this will not necessarily work for but given you example data all cases are met.  I noticed that you gave an example with two valid mobile phones (id 5) so I wrote the code to return both of these numbers rather than just the one in your example output.  This can be changed if you want.

The cases that will be missed are those where the 11 digits you want are separated somehow, such as: 133-4880-8380 you can choose to handle this in a few ways but can open yourself up to issues depending on your exact choice. 

if you choose to compress the characters ()/-+ for example then if you take the numbers from id 3:

+88-01713015360/+86-13747881289 will change to -> 88017130153608613747881289 and you will find a erronious valid number which I bolded.

If numbers are always separated by a / when multiple numbers are in a single field not compressing the / character would correct this issue,

but as I'm sure you can tell there are plenty of other subtle ways that these types of operations can go wrong.

View solution in original post

6 REPLIES 6
FriedEgg
SAS Employee

Valid phone number and valid mobile phone number are two very different concepts.  What is the ultimate goal here?  It is easiest to provide a small sample of input you expect and the output you would want to see.  What country or countries will these numbers be for?  Should only number for a certain list of countries be considered valid?  What is the significance of the prefix selection you mention.  One approach to validate phone numbers is to use regular expressions such as the following:

^\d\d\d([-.\s]?)\d\d\d\1\d\d\d\d$|^(:?(:?\(\d\d\d\))?\s*\d\d)?\d[-.\s]?\d\d\d\d$

the above will not work for every case:

Currently marks as valid, phone numbers of the following forms:      
(734) 555 1212      
(734) 555.1212      
(734) 555-1212      
(734) 5551212      
(734)5551212      
734 555 1212     
734.555.1212      
734-555-1212      
7345551212      
555 1212      
555.1212      
555-1212      
5551212      
5 1212      
5.1212      
5-1212      
51212

Currently marks as invalid, phone numbers of the following forms:         
734-555.1212         
734-5551212

I borrowed this regular expression from a Perl module called Number::Phone::US (http://search.cpan.org/~kennedyh/Number-Phone-US-1.5/lib/Number/Phone/US.pm) by Hugh Kennedy

bbb_NG
Fluorite | Level 6

FriedEgg,

Thank you for giving me instructions on US formate MP.

But it is an Chinese MP problem.

What valid format in china

is prefix 133,135,136 etc (about 20 prefix)

+a serial number with length=8

a total 11 numbers constitutes a valid number.

say 13333697005

what's on hand is a valid prefix data table MP_prefix

and a data set customer_MP with 4 calculated fields

t1leftcomlen (which means extracted from tel no with compressed left 11 characters),    t1rightcomlen

(which means extracted from tel no with compressed right 11 characters),   ,    mpleftcomlen,    mprightcomlen.

I ask if anyone can help to give a solution based on the above resources.

Thanks.

FriedEgg
SAS Employee

It would still be ideal if you can provide a small selection of sample data for what you have in 'tel no'n and 'mobile/pager no'n and what you want as output from the process.  This gives everyone the best opportunity to see what exactly you are dealing with and how to best aide you with a solution.  The fact that we are dealing with validating Chinese Mobile Phone numbers is helpful information as now I know the proper numbering system to reference and I can see why you are denoting the prefix selection which helps identify a mobile phone.  Also to ignore international numbers outside of China, will you also want to remove phone numbers from Hong Kong, Macau, etc...?

http://en.wikipedia.org/wiki/Telephone_numbers_in_China

In mainland China, mobile phone numbers have 11 digits in the format 1xx-xxxx-xxxx. The first three digits of mobile phone numbers (13x, 15x and 18x) designate the mobile phone service provider and the next four digits is a regional code and the last four digits are assigned by the mobile service provider as part of the customer ID.

As the numbers were introduced over time, it is possible to recognize the age of a number: The oldest GSM numbers start with 1390..., the second oldest 1380... and 1300... Keeping the same number over time is somewhat associated with stability and reliability of the owner. The fourth digit was introduced later and is 0 for all old numbers. In a further extension, non-139,138,130 numbers were introduced. The fifth to seventh digit again relate to age and location.

Even earlier, before GSM age, mobile phones had numbers starting with 9. Those numbers were eventually translated into 1390xx9..., where xx were local identifiers.

Mobile service providers can be identified by the first three or four digits as follows:

PrefixProviderNetwork
130/1/2China UnicomGSM
133China UnicomCDMA
1340-1348China MobileGSM
1349ChinaSatSatellite
135/6/7/8/9China MobileGSM
145China UnicomWCDMA (Data-plans only)
147China MobileTD-SCDMA (Data-plans only)
150China MobileGSM
151China MobileGSM
152China MobileGSM
153China TelecomCDMA
155China UnicomGSM
156China UnicomGSM (upgradeable to WCDMA)
157China MobileTD-SCDMA
158China MobileGSM
159China MobileGSM
180China TelecomCDMA
182China MobileGSM
185China UnicomWCDMA
186China UnicomWCDMA
187China MobileGSM
188China MobileTD-SCDMA
189China TelecomCDMA

I would try the following regular expression as a starting point for your work:

prxmatch('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/o',compress('tel no'n,,'dk'))>0

or prxmatch('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/o',compress('mobile/pager no'n,,'dk'))>0

bbb_NG
Fluorite | Level 6

Dear FriedEgg ,

I'm so appreciated the level you absorbed in, very very thanks.

Tel noMPWanted Output
852963433285298889175 
 85298842876 
+88028852469/+852-23844511+88-01713015360/+86-1374788128913747881289
0085290169181334880838013348808380
139165883881381688438813916588388
 1390129366813901293668
008602162139806861582166885215821668852


that's the possible input, and wanted.

I just hesitate whether to put it on because of sensetive data.

now i modify each number.

I am now writing

(case when t1.mpleftcomlen/*the left part of MP field (11 length) then compressed with no *()/-+*/=11 and (SUBSTRn(MPleft/*left part of mp field 11 length*/, 1,3)) in ('133','153','180','189','130','131','132','155','156','185','186','134','135','136','137','138','139','147','150','151','152','157','158','159','182','187','188') then 1

else 0 end) as mpleftflag.

thank you for your helpful attitude.

FriedEgg
SAS Employee

data foo;

infile cards truncover;

input id (tel mp) (:$31.);

cards;

1 8529634332 85298889175

2 . 85298842876

3 +88028852469/+852-23844511 +88-01713015360/+86-13747881289

4 008529016918 13348808380

5 13916588388 13816884388

6 . 13901293668

7 008602162139806 8615821668852

;

run;

data bar;

retain _pid;

if _n_=1 then _pid=prxparse('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/');

set foo;

_tels=catx('/',of tel mp);

_start=1;

_stop=length(_tels);

call prxnext(_pid,_start,_stop,_tels,_pos,_len);

  if _pos=0 then output;

  do while(_pos>0);

   want=compress(substr(_tels,_pos,_len));

   output;

   call prxnext(_pid,_start,_stop,_tels,_pos,_len);

  end;

drop _:;

run;

proc sort data=bar nodupkey; by id want; run;

proc print data=bar noobs; run;

idtelmpwant
1852963433285298889175
2 85298842876
3+88028852469/+852-23844511+88-01713015360/+86-1374788128913747881289
40085290169181334880838013348808380
5139165883881381688438813816884388
5139165883881381688438813916588388
6 1390129366813901293668
7008602162139806861582166885215821668852

Please note that there are many cases this will not necessarily work for but given you example data all cases are met.  I noticed that you gave an example with two valid mobile phones (id 5) so I wrote the code to return both of these numbers rather than just the one in your example output.  This can be changed if you want.

The cases that will be missed are those where the 11 digits you want are separated somehow, such as: 133-4880-8380 you can choose to handle this in a few ways but can open yourself up to issues depending on your exact choice. 

if you choose to compress the characters ()/-+ for example then if you take the numbers from id 3:

+88-01713015360/+86-13747881289 will change to -> 88017130153608613747881289 and you will find a erronious valid number which I bolded.

If numbers are always separated by a / when multiple numbers are in a single field not compressing the / character would correct this issue,

but as I'm sure you can tell there are plenty of other subtle ways that these types of operations can go wrong.

bbb_NG
Fluorite | Level 6

FriedEgg,

Thanks for your answer and paitience. As a newbie I may spending half day digesting the info..

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 24231 views
  • 3 likes
  • 2 in conversation