- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have 2 fields ,'Tel No' and 'Mobile Phone Number',
in which, Mobile phone number might exist in Tel no field.
values might contain()*-+/ and '0-9' ,values may have prefix like zone ,country code.
there's possibility that customer input 2 Mobile phone numbers altogether as one field value.
General speaking, there's all possibility for the input for MP.
Criteria for selection,
as a valid MP number, it should start with
'133','139','150' etc(about 20 numberic prefix)
with a total length of the mp number equals 11,say 13333697005.
I first
(SUBSTRn(t1.'Tel No'n, 1,11)) AS tel1left,(SUBSTRN(t1.'Tel No'n, LENGTH(t1.'Tel No'n)-11, 11)) AS tel1right,
(SUBSTRn(t1.'Mobile/Pager No'n, 1,11)) AS MPleft,
(SUBSTRN(t1.'Mobile/Pager No'n, LENGTH(t1.'Mobile/Pager No'n)-11, 11)) AS MPright
pick out those 4 fields.
then
length((COMPRESS(t1.tel1left, '()-/,*&+'))) AS t1leftcomlen,
length((COMPRESS(t1.tel1right, '()-/,*&+'))) AS t1rightcomlen,
length((COMPRESS(t1.MPleft, '()-/,*&+'))) AS mpleftcomlen,
length((COMPRESS(t1.MPright, '()-/,*&+'))) AS mprightcomlen
find those compressed length=11
and I have a data MP_prefix.
can any one give hand how to pick out valid Mobile phone through the above 4 fields if exist?
if there's 2 valid, only the first found will be needed.
Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data foo;
infile cards truncover;
input id (tel mp) (:$31.);
cards;
1 8529634332 85298889175
2 . 85298842876
3 +88028852469/+852-23844511 +88-01713015360/+86-13747881289
4 008529016918 13348808380
5 13916588388 13816884388
6 . 13901293668
7 008602162139806 8615821668852
;
run;
data bar;
retain _pid;
if _n_=1 then _pid=prxparse('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/');
set foo;
_tels=catx('/',of tel mp);
_start=1;
_stop=length(_tels);
call prxnext(_pid,_start,_stop,_tels,_pos,_len);
if _pos=0 then output;
do while(_pos>0);
want=compress(substr(_tels,_pos,_len));
output;
call prxnext(_pid,_start,_stop,_tels,_pos,_len);
end;
drop _:;
run;
proc sort data=bar nodupkey; by id want; run;
proc print data=bar noobs; run;
id | tel | mp | want |
---|---|---|---|
1 | 8529634332 | 85298889175 | |
2 | 85298842876 | ||
3 | +88028852469/+852-23844511 | +88-01713015360/+86-13747881289 | 13747881289 |
4 | 008529016918 | 13348808380 | 13348808380 |
5 | 13916588388 | 13816884388 | 13816884388 |
5 | 13916588388 | 13816884388 | 13916588388 |
6 | 13901293668 | 13901293668 | |
7 | 008602162139806 | 8615821668852 | 15821668852 |
Please note that there are many cases this will not necessarily work for but given you example data all cases are met. I noticed that you gave an example with two valid mobile phones (id 5) so I wrote the code to return both of these numbers rather than just the one in your example output. This can be changed if you want.
The cases that will be missed are those where the 11 digits you want are separated somehow, such as: 133-4880-8380 you can choose to handle this in a few ways but can open yourself up to issues depending on your exact choice.
if you choose to compress the characters ()/-+ for example then if you take the numbers from id 3:
+88-01713015360/+86-13747881289 will change to -> 88017130153608613747881289 and you will find a erronious valid number which I bolded.
If numbers are always separated by a / when multiple numbers are in a single field not compressing the / character would correct this issue,
but as I'm sure you can tell there are plenty of other subtle ways that these types of operations can go wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Valid phone number and valid mobile phone number are two very different concepts. What is the ultimate goal here? It is easiest to provide a small sample of input you expect and the output you would want to see. What country or countries will these numbers be for? Should only number for a certain list of countries be considered valid? What is the significance of the prefix selection you mention. One approach to validate phone numbers is to use regular expressions such as the following:
^\d\d\d([-.\s]?)\d\d\d\1\d\d\d\d$|^(:?(:?\(\d\d\d\))?\s*\d\d)?\d[-.\s]?\d\d\d\d$
the above will not work for every case:
Currently marks as valid, phone numbers of the following forms: (734) 555 1212 (734) 555.1212 (734) 555-1212 (734) 5551212 (734)5551212 734 555 1212 734.555.1212 734-555-1212 7345551212 555 1212 555.1212 555-1212 5551212 5 1212 5.1212 5-1212 51212 Currently marks as invalid, phone numbers of the following forms: 734-555.1212 734-5551212
I borrowed this regular expression from a Perl module called Number::Phone::US (http://search.cpan.org/~kennedyh/Number-Phone-US-1.5/lib/Number/Phone/US.pm) by Hugh Kennedy
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
FriedEgg,
Thank you for giving me instructions on US formate MP.
But it is an Chinese MP problem.
What valid format in china
is prefix 133,135,136 etc (about 20 prefix)
+a serial number with length=8
a total 11 numbers constitutes a valid number.
say 13333697005
what's on hand is a valid prefix data table MP_prefix
and a data set customer_MP with 4 calculated fields
t1leftcomlen (which means extracted from tel no with compressed left 11 characters), t1rightcomlen
(which means extracted from tel no with compressed right 11 characters), , mpleftcomlen, mprightcomlen.
I ask if anyone can help to give a solution based on the above resources.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It would still be ideal if you can provide a small selection of sample data for what you have in 'tel no'n and 'mobile/pager no'n and what you want as output from the process. This gives everyone the best opportunity to see what exactly you are dealing with and how to best aide you with a solution. The fact that we are dealing with validating Chinese Mobile Phone numbers is helpful information as now I know the proper numbering system to reference and I can see why you are denoting the prefix selection which helps identify a mobile phone. Also to ignore international numbers outside of China, will you also want to remove phone numbers from Hong Kong, Macau, etc...?
http://en.wikipedia.org/wiki/Telephone_numbers_in_China
In mainland China, mobile phone numbers have 11 digits in the format 1xx-xxxx-xxxx. The first three digits of mobile phone numbers (13x, 15x and 18x) designate the mobile phone service provider and the next four digits is a regional code and the last four digits are assigned by the mobile service provider as part of the customer ID.
As the numbers were introduced over time, it is possible to recognize the age of a number: The oldest GSM numbers start with 1390..., the second oldest 1380... and 1300... Keeping the same number over time is somewhat associated with stability and reliability of the owner. The fourth digit was introduced later and is 0 for all old numbers. In a further extension, non-139,138,130 numbers were introduced. The fifth to seventh digit again relate to age and location.
Even earlier, before GSM age, mobile phones had numbers starting with 9. Those numbers were eventually translated into 1390xx9..., where xx were local identifiers.
Mobile service providers can be identified by the first three or four digits as follows:
Prefix | Provider | Network |
---|---|---|
130/1/2 | China Unicom | GSM |
133 | China Unicom | CDMA |
1340-1348 | China Mobile | GSM |
1349 | ChinaSat | Satellite |
135/6/7/8/9 | China Mobile | GSM |
145 | China Unicom | WCDMA (Data-plans only) |
147 | China Mobile | TD-SCDMA (Data-plans only) |
150 | China Mobile | GSM |
151 | China Mobile | GSM |
152 | China Mobile | GSM |
153 | China Telecom | CDMA |
155 | China Unicom | GSM |
156 | China Unicom | GSM (upgradeable to WCDMA) |
157 | China Mobile | TD-SCDMA |
158 | China Mobile | GSM |
159 | China Mobile | GSM |
180 | China Telecom | CDMA |
182 | China Mobile | GSM |
185 | China Unicom | WCDMA |
186 | China Unicom | WCDMA |
187 | China Mobile | GSM |
188 | China Mobile | TD-SCDMA |
189 | China Telecom | CDMA |
I would try the following regular expression as a starting point for your work:
prxmatch('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/o',compress('tel no'n,,'dk'))>0
or prxmatch('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/o',compress('mobile/pager no'n,,'dk'))>0
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear FriedEgg ,
I'm so appreciated the level you absorbed in, very very thanks.
Tel no | MP | Wanted Output |
8529634332 | 85298889175 | |
85298842876 | ||
+88028852469/+852-23844511 | +88-01713015360/+86-13747881289 | 13747881289 |
008529016918 | 13348808380 | 13348808380 |
13916588388 | 13816884388 | 13916588388 |
13901293668 | 13901293668 | |
008602162139806 | 8615821668852 | 15821668852 |
that's the possible input, and wanted.
I just hesitate whether to put it on because of sensetive data.
now i modify each number.
I am now writing
(case when t1.mpleftcomlen/*the left part of MP field (11 length) then compressed with no *()/-+*/=11 and (SUBSTRn(MPleft/*left part of mp field 11 length*/, 1,3)) in ('133','153','180','189','130','131','132','155','156','185','186','134','135','136','137','138','139','147','150','151','152','157','158','159','182','187','188') then 1
else 0 end) as mpleftflag.
thank you for your helpful attitude.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data foo;
infile cards truncover;
input id (tel mp) (:$31.);
cards;
1 8529634332 85298889175
2 . 85298842876
3 +88028852469/+852-23844511 +88-01713015360/+86-13747881289
4 008529016918 13348808380
5 13916588388 13816884388
6 . 13901293668
7 008602162139806 8615821668852
;
run;
data bar;
retain _pid;
if _n_=1 then _pid=prxparse('/1(?:3\d|4[57]|5[0-35-9]|8[0235-9])\d{8}/');
set foo;
_tels=catx('/',of tel mp);
_start=1;
_stop=length(_tels);
call prxnext(_pid,_start,_stop,_tels,_pos,_len);
if _pos=0 then output;
do while(_pos>0);
want=compress(substr(_tels,_pos,_len));
output;
call prxnext(_pid,_start,_stop,_tels,_pos,_len);
end;
drop _:;
run;
proc sort data=bar nodupkey; by id want; run;
proc print data=bar noobs; run;
id | tel | mp | want |
---|---|---|---|
1 | 8529634332 | 85298889175 | |
2 | 85298842876 | ||
3 | +88028852469/+852-23844511 | +88-01713015360/+86-13747881289 | 13747881289 |
4 | 008529016918 | 13348808380 | 13348808380 |
5 | 13916588388 | 13816884388 | 13816884388 |
5 | 13916588388 | 13816884388 | 13916588388 |
6 | 13901293668 | 13901293668 | |
7 | 008602162139806 | 8615821668852 | 15821668852 |
Please note that there are many cases this will not necessarily work for but given you example data all cases are met. I noticed that you gave an example with two valid mobile phones (id 5) so I wrote the code to return both of these numbers rather than just the one in your example output. This can be changed if you want.
The cases that will be missed are those where the 11 digits you want are separated somehow, such as: 133-4880-8380 you can choose to handle this in a few ways but can open yourself up to issues depending on your exact choice.
if you choose to compress the characters ()/-+ for example then if you take the numbers from id 3:
+88-01713015360/+86-13747881289 will change to -> 88017130153608613747881289 and you will find a erronious valid number which I bolded.
If numbers are always separated by a / when multiple numbers are in a single field not compressing the / character would correct this issue,
but as I'm sure you can tell there are plenty of other subtle ways that these types of operations can go wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answer and paitience. As a newbie I may spending half day digesting the info..