As the title shows, I'm wondering how to match the multi-byte characters using perl regular expression. Encoding of my SAS is UTF-8. Now I need to match a series of multi-byte characters. But I found that single-byte characters matching is well supported, meanwhile multi-byte characters matching seems not. For example, if I want to match all the printable ASCII characters, it works fine.
data _null_;
text="€123";
pos=prxmatch("/[\x20-\x7E]/",text);
put pos=;
run;
/*Results as below*/
pos=4;
But when it comes to multi-byte characters, things changed.
data _null_;
text='à';
len=length(text);
put len=;
/* match latin small letter with acute */
pos1=prxmatch('/\x{C3A0}/', text); /* 'à': U+00E0 */
put pos1=;
pos2=prxmatch('/\xC3A0/', text); /* 'à': U+00E0 */
put pos2=;
pos3=prxmatch('/\xC3\xA0/', text); /* 'à': U+00E0 */
put pos3=;
pos4=prxmatch('/\xC3\xA1/', text); /* 'à': U+00E0 */
put pos4=;
run;
/*Results*/
len=2
pos1=2
pos2=0
pos3=1
pos4=0
The character I entered is a double-byte character, I can't get the right position when considering double-byte match. Are perl regular functions designed for single-byte character only? If so, how can I complete multi-byte characters matching? For example, I want to match all the characters in the range [U+4E00-U+9FA5] (UTF8 range: E4B880-E9BEA5). How to write the code?
@Maplefin wrote:
The lastest version of SAS base does not support multibyte for PRX?
Correct. Also the latest version of SAS9.4 does not support multibyte for PRX.
Only SAS Viya versions supports multibyte for prx... functions (starting from release 2021.1.6/LTS 2021.2)
Functions for multibyte need to be I18N Level 2. You find the level per function here: Internationalization Compatibility for SAS String Functions
Looks like you are on SAS 9.4M7. The latest maintenance release for this is M8, and that was essentially just security fixes. So no I don't think upgrading to M8 will give you multibyte functionality.
@Maplefin wrote:
The lastest version of SAS base does not support multibyte for PRX?
Correct. Also the latest version of SAS9.4 does not support multibyte for PRX.
You can search for multiple byte strings. You just cannot treat them as if they were ONE character.
So if you run this code in a SAS session that is using UTF-8 encoding it will properly find that the first normal ASCII character appears at position 4 in the string.
1 data _null_; 2 text="€123"; 3 pos=prxmatch("/[\x20-\x7E]/",text); 4 put pos= text= text=$hex.; 5 run; pos=4 text=€123 text=E282AC313233
PS If you want your code to be portable do not put non-ASCII characters into your code. If you want the Euro symbol then use something like:
text='E282AC'x||"123";
Do the same thing when building a RegEx expressions.
data _null_;
text='C3A0'x||'A'; * In UTF-8 Encoding that will be lowercase a with acute and uppercase A ;
len=length(text);
klen=klength(text);
/* match lowercase a with acute */
pos1=prxmatch('/'||'C3A0'x||'/', text);
/* match uppercase A */
pos2=prxmatch('/A/', text);
put text=$quote.
/ text=$hex.
/ len=
/ klen=
/ pos1=
/ pos2=
;
run;
Result
Note I posted a photo because if you copy the text you get a demonstration of the problems that still exist with trying to use multiple byte characters.
text="Ã A" text=C3A041 len=3 klen=2 pos1=1 pos2=3
In WLATIN1 encoding the C3 is Uppercase A with Tilde and the A0 is non-breaking space.
I have improve my solution, it is a user-defined function now! This function named "anyhan", you can use it like "anynum" or "anyalpha", to search for the first Chinese character(Basic set: U+4E00-9FA5).
proc fcmp outlib=work.funcs.char;
%*Find the first Chinese character(Basic set: U+4E00-9FA5);
function anyhan(string$);
klen=klength(string);
rst=0;
do i=1 to klen until(rst^=0);
kchar=ksubstr(string,i,1);
if length(kchar)>1 then do;
if '\u4E00'<=unicodec(kchar)<='\u9F5A' then rst=find(string,kchar,'t');
end;
end;
return(rst);
endfunc;
run;
data have;
input text $42.;
cards;
42anyhan
42任何汉字
42any汉字
42 +_@#$汉字
42龼龽龾龿鿀鿁汉字
أي كانجي
テストします
Путин
;
run;
option cmplib=work.funcs;
data want;
set have;
han_pos=anyhan(text);
run;
The result looks like:
Note: The result value indicates the position of the first Chinese character in source string, however, it will be effected by the encoding of SAS session.
How about making a pair? It's always better to have a company 🙂
proc fcmp outlib=work.funcs.char;
%*Find the first Chinese character(Basic set: U+4E00-9FA5);
function anyhan(string$);
klen=klength(string);
do i=1 to klen;
kchar=ksubstr(string,i,1);
if length(kchar)>1 AND '\u4E00'<=unicodec(kchar)<='\u9F5A' then
return(find(string,kchar,'t'));
end;
return(0);
endfunc;
function kanyhan(string$);
klen=klength(string);
do i=1 to klen;
kchar=ksubstr(string,i,1);
if length(kchar)>1 AND '\u4E00'<=unicodec(kchar)<='\u9F5A' then
return(i);
end;
return(0);
endfunc;
run;
data have;
input text $42.;
cards;
42anyhan
42任何汉字
42any汉字
42 +_@#$汉字
42龼龽龾龿鿀鿁汉字
أي كانجي
テストします
Путин
żółć42任何汉字
;
run;
option cmplib=work.funcs;
data want;
set have;
han_pos=anyhan(text);
khan_pos=kanyhan(text);
run;
proc print;
run;
Bart
Thanks a lot! Very helpful and innovative solution.
I believe you could get access to Viya via SAS onDemand for Academics. There is also a category for Independent Learners.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.