Solved: Re: How to match the multi-byte characters in SAS?

Maplefin · Posted 08-11-2023 03:47 AM

As the title shows, I'm wondering how to match the multi-byte characters using perl regular expression. Encoding of my SAS is UTF-8. Now I need to match a series of multi-byte characters. But I found that single-byte characters matching is well supported, meanwhile multi-byte characters matching seems not. For example, if I want to match all the printable ASCII characters, it works fine.

data _null_;
text="€123";
pos=prxmatch("/[\x20-\x7E]/",text);
put pos=;
run;

/*Results as below*/
pos=4;

But when it comes to multi-byte characters, things changed.

data _null_;
   text='à';
   len=length(text);
   put len=;
   /* match latin small letter with acute */
   pos1=prxmatch('/\x{C3A0}/', text);   /* 'à': U+00E0 */
   put pos1=;
   pos2=prxmatch('/\xC3A0/', text);     /* 'à': U+00E0 */
   put pos2=;
   pos3=prxmatch('/\xC3\xA0/', text);       /* 'à': U+00E0 */
   put pos3=;
   pos4=prxmatch('/\xC3\xA1/', text);       /* 'à': U+00E0 */
   put pos4=;
run;

/*Results*/
len=2
pos1=2
pos2=0
pos3=1
pos4=0

The character I entered is a double-byte character, I can't get the right position when considering double-byte match. Are perl regular functions designed for single-byte character only? If so, how can I complete multi-byte characters matching? For example, I want to match all the characters in the range [U+4E00-U+9FA5] (UTF8 range: E4B880-E9BEA5). How to write the code?

Patrick · Posted 08-11-2023 06:45 PM

@Maplefin wrote:
The lastest version of SAS base does not support multibyte for PRX?

Correct. Also the latest version of SAS9.4 does not support multibyte for PRX.

View solution in original post

JosvanderVelden · Posted 08-11-2023 05:07 AM

Have you seen this post: https://communities.sas.com/t5/SAS-Communities-Library/PRX-Functions-to-Support-Multibyte-Characters...?
What is the SAS version you are using?

Maplefin · Posted 08-11-2023 05:50 AM

Yeah , I read the post and got inspired, so I tried the codes in it. But got just errors. My sas version is 9.04.01M7P080520.

Patrick · Posted 08-11-2023 05:48 AM

Only SAS Viya versions supports multibyte for prx... functions (starting from release 2021.1.6/LTS 2021.2)

Functions for multibyte need to be I18N Level 2. You find the level per function here: Internationalization Compatibility for SAS String Functions

Maplefin · Posted 08-11-2023 05:55 AM

The lastest version of SAS base does not support multibyte for PRX?

SASKiwi · Posted 08-11-2023 06:25 PM

Looks like you are on SAS 9.4M7. The latest maintenance release for this is M8, and that was essentially just security fixes. So no I don't think upgrading to M8 will give you multibyte functionality.

Patrick · Posted 08-11-2023 06:45 PM

@Maplefin wrote:
The lastest version of SAS base does not support multibyte for PRX?

Correct. Also the latest version of SAS9.4 does not support multibyte for PRX.

Tom · Posted 08-12-2023 02:30 PM

You can search for multiple byte strings. You just cannot treat them as if they were ONE character.

So if you run this code in a SAS session that is using UTF-8 encoding it will properly find that the first normal ASCII character appears at position 4 in the string.

1    data _null_;
2      text="€123";
3      pos=prxmatch("/[\x20-\x7E]/",text);
4      put pos= text= text=$hex.;
5    run;

pos=4 text=€123 text=E282AC313233

PS If you want your code to be portable do not put non-ASCII characters into your code. If you want the Euro symbol then use something like:

  text='E282AC'x||"123";

Do the same thing when building a RegEx expressions.

data _null_;
   text='C3A0'x||'A';  * In UTF-8 Encoding that will be lowercase a with acute and uppercase A ;
   len=length(text);
   klen=klength(text);
   /* match lowercase a with acute */
   pos1=prxmatch('/'||'C3A0'x||'/', text); 
   /* match uppercase A */
   pos2=prxmatch('/A/', text); 
   put text=$quote.
     / text=$hex.
     / len=
     / klen=
     / pos1=
     / pos2=
   ;
run;

Result

Note I posted a photo because if you copy the text you get a demonstration of the problems that still exist with trying to use multiple byte characters.

text="Ã A"
text=C3A041
len=3
klen=2
pos1=1
pos2=3

In WLATIN1 encoding the C3 is Uppercase A with Tilde and the A0 is non-breaking space.

Maplefin · Posted 08-14-2023 01:56 AM

Thanks for your reply, I have got the answer. The latest version of SAS does not support multibyte for PRX functions. Maybe SAS Viya can do it? But I don't have the license, I can not try it in SAS Viya.

whymath · Posted 08-14-2023 02:13 AM

It seems you want to check simple Chinese characters, may one of my post could help? https://bbs.pinggu.org/thread-11289025-1-1.html

whymath · Posted 08-16-2023 11:49 PM

I have improve my solution, it is a user-defined function now! This function named "anyhan", you can use it like "anynum" or "anyalpha", to search for the first Chinese character(Basic set: U+4E00-9FA5).

proc fcmp outlib=work.funcs.char;
  %*Find the first Chinese character(Basic set: U+4E00-9FA5);
  function anyhan(string$);
    klen=klength(string);
    rst=0;
    do i=1 to klen until(rst^=0);
      kchar=ksubstr(string,i,1);
      if length(kchar)>1 then do;
        if '\u4E00'<=unicodec(kchar)<='\u9F5A' then rst=find(string,kchar,'t');
      end;
    end;
    return(rst);
  endfunc;
run;

data have;
  input text $42.;
  cards;
42anyhan
42任何汉字
42any汉字
42 +_@#$汉字
42龼龽龾龿鿀鿁汉字
أي كانجي
テストします
Путин
  ;
run;

option cmplib=work.funcs;
data want;
  set have;
  han_pos=anyhan(text);
run;

The result looks like:

Note: The result value indicates the position of the first Chinese character in source string, however, it will be effected by the encoding of SAS session.

yabwon · Posted 08-17-2023 02:50 AM

How about making a pair? It's always better to have a company 🙂

proc fcmp outlib=work.funcs.char;
  %*Find the first Chinese character(Basic set: U+4E00-9FA5);
  function anyhan(string$);
    klen=klength(string);
    do i=1 to klen;
      kchar=ksubstr(string,i,1);
      if length(kchar)>1 AND '\u4E00'<=unicodec(kchar)<='\u9F5A' then 
        return(find(string,kchar,'t'));
    end;
    return(0);
  endfunc;

  function kanyhan(string$);
    klen=klength(string);
    do i=1 to klen;
      kchar=ksubstr(string,i,1);
      if length(kchar)>1 AND '\u4E00'<=unicodec(kchar)<='\u9F5A' then 
        return(i);
    end;
    return(0);
  endfunc;

run;

data have;
  input text $42.;
  cards;
42anyhan
42任何汉字
42any汉字
42 +_@#$汉字
42龼龽龾龿鿀鿁汉字
أي كانجي
テストします
Путин
żółć42任何汉字
  ;
run;

option cmplib=work.funcs;
data want;
  set have;
  han_pos=anyhan(text);
  khan_pos=kanyhan(text);
run;   
proc print;
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

whymath · Posted 08-18-2023 12:04 AM

Very thoughtful consideration, Thank you~

Maplefin · Posted 09-18-2023 03:18 AM

Thanks a lot! Very helpful and innovative solution.

Patrick · Posted 08-14-2023 02:15 AM

I believe you could get access to Viya via SAS onDemand for Academics. There is also a category for Independent Learners.

Classroom Training Available!