BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Maplefin
Fluorite | Level 6

As the title shows, I'm wondering how to match the multi-byte characters using perl regular expression. Encoding of my SAS is UTF-8. Now I need to match a series of multi-byte characters. But I found that single-byte characters matching is well supported, meanwhile multi-byte characters matching seems not. For example, if I want to match all the printable ASCII characters, it works fine.

data _null_;
text="€123";
pos=prxmatch("/[\x20-\x7E]/",text);
put pos=;
run;

/*Results as below*/
pos=4;

 But when it comes to multi-byte characters, things changed.

data _null_;
   text='à';
   len=length(text);
   put len=;
   /* match latin small letter with acute */
   pos1=prxmatch('/\x{C3A0}/', text);   /* 'à': U+00E0 */
   put pos1=;
   pos2=prxmatch('/\xC3A0/', text);     /* 'à': U+00E0 */
   put pos2=;
   pos3=prxmatch('/\xC3\xA0/', text);       /* 'à': U+00E0 */
   put pos3=;
   pos4=prxmatch('/\xC3\xA1/', text);       /* 'à': U+00E0 */
   put pos4=;
run;

/*Results*/
len=2 pos1=2 pos2=0 pos3=1 pos4=0

The character I entered is a double-byte character, I can't get the right position when considering double-byte match. Are perl regular functions designed for single-byte character only? If so, how can I complete multi-byte characters matching? For example, I want to match all the characters in the range [U+4E00-U+9FA5] (UTF8 range: E4B880-E9BEA5). How to write the code?

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@Maplefin wrote:
The lastest version of SAS base does not support multibyte for PRX?

Correct. Also the latest version of SAS9.4 does not support multibyte for PRX.

View solution in original post

15 REPLIES 15
Maplefin
Fluorite | Level 6
Yeah , I read the post and got inspired, so I tried the codes in it. But got just errors. My sas version is 9.04.01M7P080520.
Patrick
Opal | Level 21

Only SAS Viya versions supports multibyte for prx... functions (starting from release 2021.1.6/LTS 2021.2)

Functions for multibyte need to be I18N Level 2. You find the level per function here:  Internationalization Compatibility for SAS String Functions 

 

Maplefin
Fluorite | Level 6
The lastest version of SAS base does not support multibyte for PRX?
SASKiwi
PROC Star

Looks like you are on SAS 9.4M7. The latest maintenance release for this is M8, and that was essentially just security fixes. So no I don't think upgrading to M8 will give you multibyte functionality.

Patrick
Opal | Level 21

@Maplefin wrote:
The lastest version of SAS base does not support multibyte for PRX?

Correct. Also the latest version of SAS9.4 does not support multibyte for PRX.

Tom
Super User Tom
Super User

You can search for multiple byte strings.  You just cannot treat them as if they were ONE character.

So if you run this code in a SAS session that is using UTF-8 encoding it will properly find that the first normal ASCII character appears at position 4 in the string.

1    data _null_;
2      text="€123";
3      pos=prxmatch("/[\x20-\x7E]/",text);
4      put pos= text= text=$hex.;
5    run;

pos=4 text=€123 text=E282AC313233

PS If you want your code to be portable do not put non-ASCII characters into your code.  If you want the Euro symbol then use something like:

  text='E282AC'x||"123";

Do the same thing when building a RegEx expressions.

data _null_;
   text='C3A0'x||'A';  * In UTF-8 Encoding that will be lowercase a with acute and uppercase A ;
   len=length(text);
   klen=klength(text);
   /* match lowercase a with acute */
   pos1=prxmatch('/'||'C3A0'x||'/', text); 
   /* match uppercase A */
   pos2=prxmatch('/A/', text); 
   put text=$quote.
     / text=$hex.
     / len=
     / klen=
     / pos1=
     / pos2=
   ;
run;

Result

Tom_0-1691864896523.png

Note I posted a photo because if you copy the text you get a demonstration of the problems that still exist with trying to use multiple byte characters.

text="àA"
text=C3A041
len=3
klen=2
pos1=1
pos2=3

In WLATIN1 encoding the C3 is Uppercase A with Tilde and the A0 is non-breaking space.

 

 

Maplefin
Fluorite | Level 6
Thanks for your reply, I have got the answer. The latest version of SAS does not support multibyte for PRX functions. Maybe SAS Viya can do it? But I don't have the license, I can not try it in SAS Viya.
whymath
Lapis Lazuli | Level 10
It seems you want to check simple Chinese characters, may one of my post could help? https://bbs.pinggu.org/thread-11289025-1-1.html
whymath
Lapis Lazuli | Level 10

I have improve my solution, it is a user-defined function now! This function named "anyhan", you can use it like "anynum" or "anyalpha", to search for the first Chinese character(Basic set: U+4E00-9FA5).

proc fcmp outlib=work.funcs.char;
  %*Find the first Chinese character(Basic set: U+4E00-9FA5);
  function anyhan(string$);
    klen=klength(string);
    rst=0;
    do i=1 to klen until(rst^=0);
      kchar=ksubstr(string,i,1);
      if length(kchar)>1 then do;
        if '\u4E00'<=unicodec(kchar)<='\u9F5A' then rst=find(string,kchar,'t');
      end;
    end;
    return(rst);
  endfunc;
run;

data have;
  input text $42.;
  cards;
42anyhan
42任何汉字
42any汉字
42 +_@#$汉字
42龼龽龾龿鿀鿁汉字
أي كانجي
テストします
Путин
  ;
run;

option cmplib=work.funcs;
data want;
  set have;
  han_pos=anyhan(text);
run;

The result looks like:

3.png

 Note: The result value indicates the position of the first Chinese character in source string, however, it will be effected by the encoding of SAS session.

yabwon
Onyx | Level 15

How about making a pair? It's always better to have a company 🙂

proc fcmp outlib=work.funcs.char;
  %*Find the first Chinese character(Basic set: U+4E00-9FA5);
  function anyhan(string$);
    klen=klength(string);
    do i=1 to klen;
      kchar=ksubstr(string,i,1);
      if length(kchar)>1 AND '\u4E00'<=unicodec(kchar)<='\u9F5A' then 
        return(find(string,kchar,'t'));
    end;
    return(0);
  endfunc;

  function kanyhan(string$);
    klen=klength(string);
    do i=1 to klen;
      kchar=ksubstr(string,i,1);
      if length(kchar)>1 AND '\u4E00'<=unicodec(kchar)<='\u9F5A' then 
        return(i);
    end;
    return(0);
  endfunc;

run;

data have;
  input text $42.;
  cards;
42anyhan
42任何汉字
42any汉字
42 +_@#$汉字
42龼龽龾龿鿀鿁汉字
أي كانجي
テストします
Путин
żółć42任何汉字
  ;
run;

option cmplib=work.funcs;
data want;
  set have;
  han_pos=anyhan(text);
  khan_pos=kanyhan(text);
run;   
proc print;
run;

Bart

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



whymath
Lapis Lazuli | Level 10
Very thoughtful consideration, Thank you~
Maplefin
Fluorite | Level 6

Thanks a lot! Very helpful and innovative solution.

Patrick
Opal | Level 21

I believe you could get access to Viya via SAS onDemand for Academics. There is also a category for Independent Learners.

Patrick_0-1691993704891.png

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 2145 views
  • 5 likes
  • 7 in conversation