When you read SAS data sets in your environment, you might see this message:
NOTE: Data file xxxxx.xxxxx.DATA is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.
This message specifies that the Cross-Environment Data Access (CEDA) engine is being used to access physical data using a platform other than the one used to create the data. Because of the different data representation in the platform (for example, numeric data size, byte alignment, endianness, etc.), CEDA is used to convert the data to the native data.
In addition to the above message, you might see the following message in SAS9 and Viya 3.5
WARNING: Some character data was lost during transcoding in the dataset xxxxx.xxxxx. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
This message is displayed when CEDA detects a transcoding error due to the difference of the encoding of the data set.
This message specifies that SAS might not properly process the character data. Unfortunately, the message does not clearly identify the type of transcoding problem. If you need to migrate to a new environment where the SAS-session encoding is different or if you need to process the data properly.
Starting in Viya 4, you will see the following CEDA transcoding error and warning messages.
1. This message is a result from a truncation error which means that your variable length is not long enough to store the data which expanded because of the transcoding.
WARNING: Some character data in the data set "XXXXXXX" was lost during transcoding. Truncation occurred during transcoding to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide.
When the following code is submitted,
/* your SAS session encoding is Shift-JIS */ data mytable (encoding=UTF8); LENGTH a $2.; a = '我'; run;
The character '我' takes 2 bytes in Shift-JIS encoding, but it is 3 bytes in UTF-8 encoding. The variable length 2 is not long enough for UTF-8 character.
2. This message is a result of detecting unsupported characters in one encoding to another.
ERROR: Some character data was lost during transcoding in the data set "XXXXXXX". It contains one or more characters that are not available in the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide.
When the following code is submitted,
data mytable2 (encoding=latin1); a = '我'; run;
The character '我' is not supported in LATIN1 encoding.
In addition to the WARNING/ERROR messages, you may get more details in the SAS log by using the MSGLEVEL option. The information is provided by setting the MSGLEVEL option to MSGLEVEL=i. The MSGLEVEL option provides INFO messages about the character and its location.
/* create EUC-CN data set */ data mylibs.tests; length a $2.; a = '我'; run;
/* read EUC-CN data set in UTF-8 session */ options MSGLEVEL=i; proc print data=mylibs.tests; run;
WARNING: Some character data in the data set "XXXXX.XXXXX" was lost during transcoding. Truncation occurred during transcoding
to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide. INFO: The length of the variable "a" in the data set "XXXXX.XXXXX" is insufficient. You might need to increase the length of the variable.
NOTE: Depending on the process, the variable information might not be available so that MSGLEVEL=i does not produce detailed information.
"Transcoding" is the process of converting character data from one encoding to another. Each encoding has different rules to represent the characters.
Encoding is categorized into 3 groups.
And encoding schema (rule) into 2 groups
This table is an example of some character data representations in various encodings.
|ASCII compatible||EBCDIC compatible||Unicode|
|¡ (Inverted Exclamation Mark)||0xA1||0x85 0x42||0xAA||n/a||0x00A1||0xC2 0xA1|
|我||n/a||0x89 0xE4||n/a||0x0E 0x49 0xDE 0x0F||0x6211||0xE6 0x88 0x91|
*NOTE: UTF-8 is considered the ASCII compatible encoding which supports 7 bit ASCII characters (a-z|A-Z and some common punctuation characters).
For example, the letter "A" that is available in any encoding might have different code point assignments in the encoding. ISO8859-1, MS932 and UTF-8 are ASCII-compatible encodings, so the code point of "A" is the same value. However in the different EBCIDC encoding schema such as EBCDIC-1047, the code-point for "A" is 0xC1.
Another example is the letter "¡" which is a Latin character. This letter is also available in non-LATIN1 encodings such MS932 and UTF-8, and it is also assigned to the multi-byte code point.
Another example is the letter "我". This is an Asian character, so it is not available in encodings like ISO8859-1 and EBCDIC-1047 which do not support Asian characters. These encodings support only LATIN characters.
Because of the limitation and different rules in the encodings, transcoding sometimes cannot perform the conversion.
Two types of issues are observed in the transcoding.
You might see another issue where character data is malformed (e.g. improper truncation of data). In this case the CEDA transcoding engine assumes it is unsupported characters.
See following articles SAS Community post "Transcoding: Understand, Troubleshoot, and Resolve a Most Mysterious SAS® Error" and Blog post "Demystifying and resolving common transcoding problems" which explain possible transcoding problems.
In the CEDA transcoding message, it indicates that you can find more information in the "SAS National Language Support (NLS): Reference Guide". Possible solutions are:
If you are not familiar with these solutions, take a look for the usage and examples.
Here are some examples to show how to resolve the transcoding issues that CEDA detects.
Example1: Truncation problem.
In this example, you attempt to read Western European character data (LATIN1 data set) into a Viya 4 UTF-8 session.
1. Create the data set in the SAS session with LATIN1 encoding which contains the national character data.
libname mylib "mylib"; data mylib.test; length var1 $4; var1 = "Ñåmë"; run;
The variable "var1" is defined as the character data type with length of 4 and is assigned the value "Ñåmë".
2. Then start a SAS session with UTF-8 encoding to read data set "test".
proc print; data=mylib.test; run;
The preceding statement produces this result. The value is truncated.
Obs var1 1 Nàm
You see this CEDA transcoding message in the SAS log.
WARNING: Some character data in the data set "MYLIB.TEST" was lost during transcoding. Truncation occurred during transcoding to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide.
If you change the MSGLEVEL option to "i", you will see more transcoding error information.
options MSGLEVEL=i; proc print data=mylib.test; run;
INFO: The length of the variable "var1" in the data set "MYLIB.TEST" is insufficient. You might need to increase the length of the variable.
This INFO message reveals that truncation occurred in the variable "var1" in the dataset and suggests increasing the size of the variable "var1".
SAS provides the solution for the truncation, either use the CVP engine or adjust the variable length.
Solution1. Using CVP engine
Submit following LIBNAME statement in the SAS session with UTF-8 encoding and PROC PRINT statement to print the data set in the CVP library.
libname mycvplib CVP 'mylib/';
proc print; data=mycvplib.test; run;
NOTE: The CVP engine is READ-ONLY access. If you need to update the data set, copy the data set to the writable SAS library.
Solution2. Expand variable length before reading.
To expand the variable length, you will need to start the SAS session with the same encoding as your data set. If the SAS session encoding is different, the CEDA-engine is enabled so that you cannot modify your data set because the CEDA-engine is READ-ONLY.
When you start the session, you can modify your variable length by using the following example.
data mylib.test; LENGTH var1 $10.; set mylib.test; run;
You can also adjust the variable length by using the %COPY_TO_UTF8 macro (or %COPY_TO_NEW_ENCODING).
The macro generates a NEW data set "newdat". You must ensure you have enough space to store the new data set.
Example 2. Unsupported character
This example attempts to read the multilingual data set that is created with a UTF-8 encoding into a SAS session with a LATIN1 session encoding.
/* Create input data set in UTF-8 */ data mylib.myunidat; length var1 $4.; var1 = '¡'; output; var1 = '我'; output; run;
This code attempts to read the Unicode (UTF-8) data set into a SAS LATIN1 session.
data mylat1; set mylib.myunidat; run;
You will see the following transcoding error message.
ERROR: Some character data was lost during transcoding in the data set "MYLIB.MYUNIDAT". It contains one or more characters that are not available in the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide.
This example reads a Unicode data set as a LATIN1 encoding so it avoids CEDA transcoding. KPROPDATA is then used to transcode data by using the UESC (Unicode Escape Sequence) option. This converts the unsupported character into Unicode Escape Sequence. In this example, '我' is converted to '\u6211'. Because the Unicode Escape sequence requires more bytes, the length of "var1" is reset to $6. There are various replacement character options for unsupported characters. Refer to the NLS documentation for more information.
data mylib2.mylat1; length var1 $6.; /* we need more space for the Unicode escape char */ set mylib.myunidat(encoding='latin1'); /* read no transcoding */ var1 = KPROPDATA(var1, 'uesc', 'utf-8', 'latin1'); /* convert to Unicode escape for unsupported character */ run;
If you want to move a LATIN1 data set into a UTF-8 data set, the UNICODE function can convert Unicode escape characters into real Unicode characters.
data mylib.myunidat2; set mylib2.mylat1; length var1 $4.; var1 = UNICODE(var1, "ESC"); /* convert Unicode escape into real Unicode character */ run;
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.