PRX Functions to Support Multibyte Characters

6 Likes

PRX is the abbreviation of “Perl Regular Expression”. The SAS PRX functions and CALL routines provide regular expression functionality based on Perl, 5.6.1. The Perl Regular Expression, or REGEX, engine in this version is designed for single byte character data. This is why SAS PRX functions cannot be used to process multi-byte characters in legacy double byte sessions, such as Chinese and Japanese SAS. With the proliferation of UTF-8 session in Viya, this functionality gap is becoming increasingly prominent because the session encoding UTF-8 is a multi byte character set (MBCS). In order to meet the needs of SAS products, starting from the release of 2021.1.6/LTS 2021.2, PRX functions and CALL routines have been upgraded to support MBCS data based on Perl version of 5.32.

This article collects some typical scenarios in MBCS processing, specifies challenges in previous PRX functions, and introduces improvements in the new version. This article will help PRX functions users deepen their cognition and understanding of multibyte character. The discussions revolve around the following topics. SAS UTF-8 session is the example executing environment in this document.

Word character
Semantics
Character class
Unicode code point

Word Character & Word Boundary

Metacharacter “\w” and “\W” match a "word character" or "non-word character". A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”. The following example uses metacharacter “\w” in regular expression and PRXCHANGE to switch the first name and last name from “Jones Fred” to “Fred Jones”.

data _null_;
   original_name = 'Jones Fred';
   switched_name = prxchange('s/(\w+) (\w+)/$2 $1/', -1, original_name);
   put original_name= / switched_name=;
run;
 
original_name=Jones Fred
switched_name=Fred Jones

Prior to 2021.1.6/LTS 2021.2, the PRX functions do not treat a multibyte character as a "\w" - word character. If the data string contains any non-SBCS word character, such as “ö” in “Jönes” which is 2-bytes character in UTF-8 encoding, regular expression outputs the unexpected result in the example below.

data _null_;
   original_name = 'Jönes Fred';
   switched_name = prxchange('s/(\w+) (\w+)/$2 $1/', -1, original_name);
   put original_name= / switched_name=;
run;
 
/* Executing prior to the release 2021.1.6 ... */
original_name=Jönes Fred
switched_name=JöFred nes

Since ‘ö’ does not act as a legal “\w”, only “nes Fred” is matched to the regular expression. The switching result will be “nes Fred”. After appending the leading “Jö”, the final output becomes “JöFred nes” here. The version of PRXCHANGE in 2021.1.6/LTS 2021.2 and later will correctly output the expected result “Fred Jönes”.

The processing on word character also impacts the behavior on checking multibyte character boundary. Metacharacter “\b” matches word boundary (exist between \w and \W). See SBCS below where “\b” finds word “i”, and replaces to “I”. Nothing will happen on “i” of “it”.

data _null_;
   str1 = 'i see it.';
   str2 = prxchange('s/\bi\b/I/', -1, str1);
   put str1= / str2=;
run;
 
/* Executing prior to the release 2021.1.6 ... *
str1=i see it.
str2=I see it.

Then, the example tries on multibyte character, “s/\bî\b/I/” detects word boundary to replace the word “î” to “I” and keeps “ît” with no change.

data _null_;
   str1 = 'î see ît.';
   str2 = prxchange('s/\bî\b/I/', -1, str1);
   put str1= / str2=;
run;
 
/* Executing prior to the release 2021.1.6 ... */
str1= î see ît.
str2= î see ît.

The version of PRXCHANGE prior to 2021.1.6/LTS 2021.2 cannot change the original string as expected. In 2021.1.6/LTS 2021.2 and later, the improved version correctly outputs “I see ît.”.

Byte semantics & character semantics

Prior to 2021.1.6/LTS 2021.2, the PRX functions always output or input character position by counting the number of bytes. This is byte semantics interface. Such as, PRXMATCH searches for a pattern match and returns the position of byte at which the pattern is found. The following example prepares some cases to show how position in byte semantics is different from user perceived position on multibyte character data.

data _null_;
   pos1 = prxmatch('/e/', 'Alfred');
   pos2 = prxmatch('/ë/', 'Ålfrëð');
   pos3 = prxmatch('/test/', '中文字符test');
   put pos1= / pos2= / pos3=;
run;
 
/* Executing prior to the release 2021.1.6 ... */
pos1=5    /* character position is 5 */
pos2=6    /* character position is 5 */
pos3=13   /* character position is 5 */

The first match is in SBCS data with no problems, position 5 is returned. Since a SBCS is always a byte, there is no difference between the program return and people’s understanding.
In the second match, the ‘ë’ is the 5th character in “Ålfrëð”. Because the byte length for “Ålfr” is 5 in UTF-8 encoding, PRXMATCH returns 6 instead of 5.
The third match finds “test” character position at the 5th position. “中文字符” has 4x3=12 bytes in UTF-8. The matching byte position 13 is returned by PRXMATCH.

Obviously, byte semantics can be challenging. The improved PRXs in 2021.1.6/LTS 2021.2 and later conform the highest Internationalization Compatibility level with character semantics. For these 3 cases in this example, they all return the 5th in character position. By the way, PRXSUBSTR routine is semantically similar.

Character Classes

The list of characters within the character class gives the set of characters matched by the class. "[...]" matches a character according to the rules of the bracketed character class defined by the "...". However, when multibyte character appears, the behavior will become strange. Please see an example of using a version of PRXCHANGE prior to 2021.1.6/LTS 2021.2.

data _null_;
   /* Use a pattern to replace all occurrences of "þat",    */
   /* "cat", or "rat" with the value "tree".                */
   length text $ 46;
   RegularExpressionId = prxparse('s/[þcr]at/tree/');
   text = 'The woods have a þat, cat, bat, and a rat!';
   /* Use CALL PRXCHANGE to perform the search and replace. */
   /* Because the argument times has a value of -1, the     */
   /* replacement is performed as many times as possible.   */
   call prxchange(RegularExpressionId, -1, text);
   put text;
run;
 
/* Garbage byte 0xC3 appears in the result string. */
The woods have a <C3>tree, tree, bat, and a tree!

In the output, there is a garbage byte 0xC3. It is the first byte of the multibyte character ‘þ’, which is 2-byte 0xC3BE in UTF_8 encoding.

Continue to look at the following example to see why. PRXMATCH searches the multibyte character directly. It works correctly. However, when a multi-byte character is in “[…]”, the multibyte character is parsed as individual bytes. PRXMATCH position becomes meaningless. This is the root cause. Since the incorrect position is in the middle of a multi-byte character, the string stream is corrupted during string substitution, then garbage occurs.

data _null_;
   text = 'a þat, cat, ßat, and a rat!';
   pos1 = prxmatch('/þat/', text);
   put pos1=;
   pos2 = prxmatch('/[þ]at/', text);
   put pos2=;
run;
 
/* Executing prior to the release 2021.1.6 ... */
pos1=3  /* correct position                    */
pos2=4  /* incorrect matching position         */

The improved PRXs in 2021.1.6/LTS 2021.2 and later match a complete character in character classes. For the example above, the multibyte character defined in “[…]” is the most basic operation unit, which will no longer be split into bytes. The matching always occurs at the position of the third character. There is also no garbage data in the example of PRXCHANGE.

Unicode Code Point

“\x{…}” is to specify any number of Unicode code points in regular expression. Unfortunately, in the previous PRX function the multibyte character search fails in string search with Unicode escape specification. In the following example, Chinese character ‘中’ can be matched if it is referenced by regular expression, but it fails when using “\x” with Unicode code point.

data _null_;
  /* '中' Unicode point is U+4E2D */
  pos1=prxmatch('/中/', '中');
  put pos1=;
  pos2=prxmatch('/\x{4E2D}/', '中');
  put pos2=;
run;
 
/* Executing prior to the release 2021.1.6 ... */
pos1=1  /* matched    */
pos2=0  /* un-matched */

In the improved version (2021.1.6/LTS 2021.2 and later), various styles of “\x” are allowed to work on multibyte characters. “\xnn” and “\x{…} can specify hexadecimal Unicode code point in regular expression, and it is possible for “\x” in character class to define a scope of character, such as [\x{0080}-\x{00FF}]. Furthermore, “\x” can also be used to define substitution string. The following example shows that all these styles work as expected in the improved version of PRX functions.

data _null_;
   /* match latin small letter with acute */
   pos1=prxmatch('/\x{00E0}/', 'à');   /* 'à': U+00E0 */
   put pos1=;
   pos2=prxmatch('/\x{E0}/', 'à');     /* 'à': U+00E0 */
   put pos2=;
   pos3=prxmatch('/\xE0/', 'à');       /* 'à': U+00E0 */
   put pos3=;
 
   /* match Chinese character */
   pos4=prxmatch('/\x{4E2D}/', '中'); /* '中': U+4E2D */
   put pos4=;
 
   /* match Latin-1 Supplement */
   pos5=prxmatch('/[\x{0080}-\x{00FF}]/', 'Jönes');
   put pos5=;
 
   /* replace space to comma */
   subs1 = prxchange("s/\x20/\x{002c}/",-1,'a b c');
   put subs1=;
   subs2 = prxchange("s/\x20/\x2c/",    -1,'a b c');
   put subs2=;
run;
 
/* Executing after the release 2021.1.6 ... */
pos1=1       /* matched */
pos2=1       /* matched */
pos3=1       /* matched */
pos4=1       /* matched */
pos5=2       /* matched */
subs1=a,b,c  /* substitution successful */
subs2=a,b,c  /* substitution successful */

Summary

By upgrading to the new Perl REGEX engine, the kernel of regular expression moves into character based processing, instead of the original byte or binary matching. The seamlessly improvement fully works on character data, even when multibyte characters are present. As shown above, any UTF-8 characters can be freely referenced by specifying the code point in PRX functions. Each string manipulation is always on the character boundary that never corrupts a complete multibyte character to garbage. The matching position is also specified with character index. Since PRX functions and CALL routines are based on pure Perl REGEX, there are much richer metacharacters in Perl that can be used in your regular expression patterns to benefit the programing. Such as “\p” to match the various Unicode properties. In addition, the PRX functions discussed in this article refer to the PRXs used in SAS Foundation, and they are only released in Viya4.

Reference

PRX functions and CALL routines
  PRXCHANGE Function
  PRXMATCH Function
  PRXPAREN Function
  PRXPARSE Function
  PRXPOSN Function
  CALL PRXCHANGE Routine
  CALL PRXDEBUG Routine
  CALL PRXFREE Routine
  CALL PRXNEXT Routine
  CALL PRXPOSN Routine
  CALL PRXSUBSTR Routine
Perl Regular Expressions Reference
Perl regular expressions
Index of Unicode Version 13.0.0 character properties in Perl
ICU's Regular Expressions

Maplefin · ‎08-10-2023

Does this really work? I tried your code in my SAS (Encoding=UTF8), but failed in matching Latin-1 supplement. It seems that SAS didn't recognize the specifed range "\x{0080}-\x{00FF}" at all. I got error logs as below:

ERROR: Invalid [] range "}-\x" before HERE mark in regex m/[\x{0080}-\x << HERE {00FF}]/
ERROR: The regular expression passed to the function PRXMATCH contains a syntax error.

My SAS version is 9.04.01M7P080520.

LaneLi · ‎01-14-2024

The feature is available since Viya release of 2021.1.6/LTS 2021.2. SAS 9 does not support it.

SAS Communities Library