Troubleshooting with new CEDA Transcoding Error Messages in Viya 4

6 Likes

When you read SAS data sets in your environment, you might see this message:

NOTE: Data file xxxxx.xxxxx.DATA is in a format that is native to another host, or the file encoding
      does not match the session encoding. Cross Environment Data Access will be used, which might
      require additional CPU resources and might reduce performance.

This message specifies that the Cross-Environment Data Access (CEDA) engine is being used to access physical data using a platform other than the one used to create the data. Because of the different data representation in the platform (for example, numeric data size, byte alignment, endianness, etc.), CEDA is used to convert the data to the native data.

In addition to the above message, you might see the following message in SAS9 and Viya 3.5

WARNING: Some character data was lost during transcoding in the dataset xxxxx.xxxxx. Either
         the data contains characters that are not representable in the new encoding or
         truncation occurred during transcoding.

This message is displayed when CEDA detects a transcoding error due to the difference of the encoding of the data set.

This message specifies that SAS might not properly process the character data. Unfortunately, the message does not clearly identify the type of transcoding problem. If you need to migrate to a new environment where the SAS-session encoding is different or if you need to process the data properly.

New CEDA transcoding messages offer more information

Starting in Viya 4, you will see the following CEDA transcoding error and warning messages.

1. This message is a result from a truncation error which means that your variable length is not long enough to store the data which expanded because of the transcoding.

WARNING: Some character data in the data set "XXXXXXX" was lost during transcoding. Truncation occurred during transcoding
         to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during
         Transcoding" in SAS National Language Support (NLS): reference guide.

When the following code is submitted,

/* your SAS session encoding is Shift-JIS */
data mytable (encoding=UTF8);
  LENGTH a $2.;
  a = '我';
run;

The character '我' takes 2 bytes in Shift-JIS encoding, but it is 3 bytes in UTF-8 encoding. The variable length 2 is not long enough for UTF-8 character.

2. This message is a result of detecting unsupported characters in one encoding to another.

ERROR: Some character data was lost during transcoding in the data set "XXXXXXX". It contains one or more characters that 
       are not available in the new encoding.  To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data
       loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide.

When the following code is submitted,

data mytable2 (encoding=latin1);
  a = '我';
run;

The character '我' is not supported in LATIN1 encoding.

Display diagnostic message

In addition to the WARNING/ERROR messages, you may get more details in the SAS log by using the MSGLEVEL option. The information is provided by setting the MSGLEVEL option to MSGLEVEL=i. The MSGLEVEL option provides INFO messages about the character and its location.

/* create EUC-CN data set */
data mylibs.tests;
   length a $2.;
   a = '我';
run;

/* read EUC-CN data set in UTF-8 session */
options MSGLEVEL=i;
proc print data=mylibs.tests;
run;

WARNING: Some character data in the data set "XXXXX.XXXXX" was lost during transcoding. Truncation occurred during transcoding
         to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during
         Transcoding" in SAS National Language Support (NLS): reference guide.

INFO: The length of the variable "a" in the data set "XXXXX.XXXXX" is insufficient. You might need to increase the length of the variable.

NOTE: Depending on the process, the variable information might not be available so that MSGLEVEL=i does not produce detailed information.

What is "Transcoding"?

"Transcoding" is the process of converting character data from one encoding to another. Each encoding has different rules to represent the characters.

Encoding is categorized into 3 groups.

ASCII-compatible
It is common encodings that are used in Windows/Unix and support 7bit ASCII characters in the same code points.
EBCDIC-compatible
It is typically used on IBM® zSystems (z/OS®) and iSeries (System i®)
Unicode-compatible
Unicode is the universal character encoding standard used for representation of text for computer processing. Unicode is fully compatible with the international standard ISO/IEC 10646 and can support up to 1,114,112 characters.

And encoding schema (rule) into 2 groups

Fixed width encodings (e.g. single-byte encodings SBCS or fixed-width Unicode)
SBCS encoding uses the single-byte (byte range of 0x00-0xFF) code-point schema, so that it can only represent the character up to 256 characters. Because of the limited code points, it can only support specific characters. An example of single-byte encoding (SBCS) is ISO8859-1, which supports characters for Western European languages, such as English, French, German, and Spanish.
Unicode uses the range of (2byte or 4byte) code-points to represent characters.
Multi-byte encodings (MBCS) (e.g. UTF-8 and Double-byte encodings DBCS)
DBCS encoding uses both single-byte code-points and multi-byte code-points to represent more than 256 characters. It supports Asian languages, which require thousands of characters.

This table is an example of some character data representations in various encodings.

	ASCII compatible		EBCDIC compatible		Unicode
	SBCS	DBCS	SBCS	DBCS	Fixed	MBCS
Letter	ISO8859-1	MS932 (Shift-JIS)	EBCDIC-1047	IBM939	UTF-16	UTF-8*
A	0x41	0x41	0xC1	0xC1	0x0041	0x41
¡ (Inverted Exclamation Mark)	0xA1	0x85 0x42	0xAA	n/a	0x00A1	0xC2 0xA1
我	n/a	0x89 0xE4	n/a	0x0E 0x49 0xDE 0x0F	0x6211	0xE6 0x88 0x91

*NOTE: UTF-8 is considered the ASCII compatible encoding which supports 7 bit ASCII characters (a-z|A-Z and some common punctuation characters).

For example, the letter "A" that is available in any encoding might have different code point assignments in the encoding. ISO8859-1, MS932 and UTF-8 are ASCII-compatible encodings, so the code point of "A" is the same value. However in the different EBCIDC encoding schema such as EBCDIC-1047, the code-point for "A" is 0xC1.

Another example is the letter "¡" which is a Latin character. This letter is also available in non-LATIN1 encodings such MS932 and UTF-8, and it is also assigned to the multi-byte code point.

Another example is the letter "我". This is an Asian character, so it is not available in encodings like ISO8859-1 and EBCDIC-1047 which do not support Asian characters. These encodings support only LATIN characters.

Because of the limitation and different rules in the encodings, transcoding sometimes cannot perform the conversion.

Two types of issues are observed in the transcoding.

A letter is not supported in another encoding, which is caused when transcoding occurs with incompatible languages. e.g. LATIN1 and Asian encodings
A letter requires more bytes than another encoding, which is caused when transcoding occurs between different encoding schemas, such as SBCS and DBCS

You might see another issue where character data is malformed (e.g. improper truncation of data). In this case the CEDA transcoding engine assumes it is unsupported characters.

See following articles SAS Community post "Transcoding: Understand, Troubleshoot, and Resolve a Most Mysterious SAS® Error" and Blog post "Demystifying and resolving common transcoding problems" which explain possible transcoding problems.

How to resolve the transcoding error

In the CEDA transcoding message, it indicates that you can find more information in the "SAS National Language Support (NLS): Reference Guide". Possible solutions are:

Character Data Truncation solution:
- CVP engine ,
- %COPY_TO_NEW_ENCODING autocall macro
- %COPY_TO_UTF8 autocall macro
Unsupported character data solution:
- KPROPDATA SAS function

If you are not familiar with these solutions, take a look for the usage and examples.

Examples:

Here are some examples to show how to resolve the transcoding issues that CEDA detects.

Example1: Truncation problem.

In this example, you attempt to read Western European character data (LATIN1 data set) into a Viya 4 UTF-8 session.

1. Create the data set in the SAS session with LATIN1 encoding which contains the national character data.

libname mylib "mylib";
data mylib.test;
  length var1 $4;
  var1 = "Ñåmë";
run;

The variable "var1" is defined as the character data type with length of 4 and is assigned the value "Ñåmë".

2. Then start a SAS session with UTF-8 encoding to read data set "test".

proc print; data=mylib.test; run;

The preceding statement produces this result. The value is truncated.

                                   Obs    var1                               

                                    1     Nàm

You see this CEDA transcoding message in the SAS log.

WARNING: Some character data in the data set "MYLIB.TEST" was lost during transcoding. Truncation occurred during transcoding
         to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during
         Transcoding" in SAS National Language Support (NLS): reference guide.

If you change the MSGLEVEL option to "i", you will see more transcoding error information.

options MSGLEVEL=i;
proc print data=mylib.test; run;

INFO: The length of the variable "var1" in the data set "MYLIB.TEST" is insufficient. You might need to increase the length of the variable.

This INFO message reveals that truncation occurred in the variable "var1" in the dataset and suggests increasing the size of the variable "var1".

SAS provides the solution for the truncation, either use the CVP engine or adjust the variable length.

Solution1. Using CVP engine

Submit following LIBNAME statement in the SAS session with UTF-8 encoding and PROC PRINT statement to print the data set in the CVP library.

libname mycvplib CVP 'mylib/';
proc print; data=mycvplib.test; run;

NOTE: The CVP engine is READ-ONLY access. If you need to update the data set, copy the data set to the writable SAS library.

Solution2. Expand variable length before reading.

To expand the variable length, you will need to start the SAS session with the same encoding as your data set. If the SAS session encoding is different, the CEDA-engine is enabled so that you cannot modify your data set because the CEDA-engine is READ-ONLY.

When you start the session, you can modify your variable length by using the following example.

data mylib.test;
  LENGTH var1 $10.;
  set mylib.test;
run;

You can also adjust the variable length by using the %COPY_TO_UTF8 macro (or %COPY_TO_NEW_ENCODING).

%COPY_TO_UTF8(mylib.test, mylib.newdat);

The macro generates a NEW data set "newdat". You must ensure you have enough space to store the new data set.

Example 2. Unsupported character

This example attempts to read the multilingual data set that is created with a UTF-8 encoding into a SAS session with a LATIN1 session encoding.

/* Create input data set in UTF-8 */
data mylib.myunidat;
  length var1 $4.;
  var1 = '¡'; output;
  var1 = '我'; output;
run;

This code attempts to read the Unicode (UTF-8) data set into a SAS LATIN1 session.

data mylat1;
  set mylib.myunidat;
run;

You will see the following transcoding error message.

ERROR: Some character data was lost during transcoding in the data set "MYLIB.MYUNIDAT". It contains one or more characters that
       are not available in the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data
       loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide.

Solution: KPROPDATA

This example reads a Unicode data set as a LATIN1 encoding so it avoids CEDA transcoding. KPROPDATA is then used to transcode data by using the UESC (Unicode Escape Sequence) option. This converts the unsupported character into Unicode Escape Sequence. In this example, '我' is converted to '\u6211'. Because the Unicode Escape sequence requires more bytes, the length of "var1" is reset to $6. There are various replacement character options for unsupported characters. Refer to the NLS documentation for more information.

data mylib2.mylat1;
  length var1 $6.;  /* we need more space for the Unicode escape char */
  set mylib.myunidat(encoding='latin1');  /* read no transcoding */
  var1 = KPROPDATA(var1, 'uesc', 'utf-8', 'latin1');  /* convert to Unicode escape for unsupported character */
run;

If you want to move a LATIN1 data set into a UTF-8 data set, the UNICODE function can convert Unicode escape characters into real Unicode characters.

data mylib.myunidat2;
  set mylib2.mylat1;
  length var1 $4.;
  var1 = UNICODE(var1, "ESC"); /* convert Unicode escape into real Unicode character */
run;

Summary:

In Viya 4, the primary encoding is UTF-8. Transcoding is more common if you are migrating to Viya 4 from SAS9 or Viya 3.5. The new CEDA transcoding messages and troubleshooting page in the SAS National Language Support Reference guide can help you understand and resolve your transcoding issues.
 

References:

Demystifying and resolving common transcoding problems – Blog post by SAS/TS
Transcoding: Understand, Troubleshoot, and Resolve a Most Mysterious SAS Error – SAS/Community post