As the title shows, I'm wondering how to match the multi-byte characters using perl regular expression. Encoding of my SAS is UTF-8. Now I need to match a series of multi-byte characters. But I found that single-byte characters matching is well supported, meanwhile multi-byte characters matching seems not. For example, if I want to match all the printable ASCII characters, it works fine. data _null_;
text="€123";
pos=prxmatch("/[\x20-\x7E]/",text);
put pos=;
run;
/*Results as below*/
pos=4; But when it comes to multi-byte characters, things changed. data _null_;
text='à';
len=length(text);
put len=;
/* match latin small letter with acute */
pos1=prxmatch('/\x{C3A0}/', text); /* 'à': U+00E0 */
put pos1=;
pos2=prxmatch('/\xC3A0/', text); /* 'à': U+00E0 */
put pos2=;
pos3=prxmatch('/\xC3\xA0/', text); /* 'à': U+00E0 */
put pos3=;
pos4=prxmatch('/\xC3\xA1/', text); /* 'à': U+00E0 */
put pos4=;
run;
/*Results*/ len=2
pos1=2
pos2=0
pos3=1
pos4=0 The character I entered is a double-byte character, I can't get the right position when considering double-byte match. Are perl regular functions designed for single-byte character only? If so, how can I complete multi-byte characters matching? For example, I want to match all the characters in the range [U+4E00-U+9FA5] (UTF8 range: E4B880-E9BEA5). How to write the code?
... View more