UTF-8 encoding supports multilingual data and is the default session encoding for SAS Viya.
There are several ways to migrate a SAS data set to UTF-8 encoding:
This document demonstrates and explains how to migrate a SAS data set to UTF-8 encoding using the two macros.
The session encoding establishes the environment to process SAS syntax and SAS data sets, and to read and write external files. To explain a session encoding, let's draw an analogy with a car's engine.
The SAS session encoding is like the car's engine. The data's encoding is the type of fuel. If your car's engine has been built to run with diesel fuel, using salty water won't work. It is the same for the SAS system. Using data encoded in one encoding might not work if your data's encoding does not match your SAS session encoding. However, unlike the car's features, the SAS system can transform data from one encoding to another one before it processes them.
To demonstrate the two macros, we are going to convert a WLATIN1 SAS data set to UTF-8.
First, let's create our WLATIN1 sample SAS dataset from a WLATIN1 SAS session.
libname wlt1 'wlt1 library' ;
data wlt1.symbols;
length symbol $1;
input symbol;
cards;
€
£
¥
+
¢
;
run;
When executing PROC PRINT and PROC CONTENTS on that data set:
proc print data=wlt1.symbols noobs ;
Title "SAS Session Encoding: %SYSFUNC(GETOPTION(ENCODING))";
run ;
proc contents data=wlt1.symbols;
run ;
it should return information similar to the following:
Now, let's attempt to read and create a UTF-8 version of the data set from WLATIN1 and from a UTF-8 SAS session using the SET statement.
SAS Session Encoding
|
SAS program to execute
|
Comments
|
---|---|---|
WLATIN1 |
|
We use the encoding data set option to specify how to create the output data set from the WLATIN1 SAS session. |
UTF-8 |
|
In both cases, and depending on the SAS versions, the SAS log displays one of the following error messages. Message #2 is the expected and correct message.
#
|
Error Messages
|
SAS Versions
|
---|---|---|
1 | ERROR: Some character data was lost during transcoding in the data set SBCS.SYMBOLS. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding. | SAS9, Viya 3.5 and early Viya 4 versions |
2 | ERROR: Some character data was lost during transcoding in the data set "WORK.SYMBOLS_UTF8". Truncation occurred during transcoding to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide. | Starting Viya 4 2023.08 |
Why are we getting the following statement in error message 2?
ERROR: Truncation occurred during transcoding to the new encoding.
In our WLATIN1 SAS data set, all the characters are encoded on 1 byte. Also, the length for the variable 'symbol, is defined with $1, which means the variable can accommodate only 1 byte.
Once transcoded, internally, some of these characters are now encoded with more bytes. The Euro character €, for example, is encoded on 3 bytes (See table below) while other characters are encoded on 2 or 1 bytes. However, the initial variable length is not changed and is still defined as $1 which is not big enough to receive the 2 or 3 bytes required by some of these characters.
WLATIN1 | Symbol | UTF-8 | ||
Hexadecimal representation | Length | Length | Hexadecimal representation | |
'80'x | 1 | € | 3 | 'E282AC'x |
'A3'x | 1 | £ | 2 | 'C2A3'x |
'A5'x | 1 | ¥ | 2 | 'C2A5'x |
'2B'x | 1 | + | 1 | '2B'x |
'A2'x | 1 | ¢ | 2 | 'C2A2'x |
When running from a UTF-8 SAS session, the %COPY_TO_UTF8 macro can help migrate the SAS data set.
%COPY_TO_UTF8(from, to)
The %COPY_TO_UTF8 macro creates a new version of a data set. The macro can be used only on a UTF-8 SAS session and can import data sets from any encoding.
%COPY_TO_UTF8(wlt1.symbols, symbols);
The macro calculates the minimum required length needed for each character variable in the data set and defines a new LENGTH statement when re-creating the SAS data set.
To accomplish this, the macro makes use of the CVP engine with the CVPMULTIPLIER set to 4 to ensure that there is no truncation during the transcoding process. Then the new length is calculated based on the new encoded string after the entire data set is read.
In our example, the new length needed is now 3 (bytes), and the SAS data set is re-created with the new length definition.
Generated code seen from the log:
MPRINT(COPY_TO_UTF8): libname _INU8CVP CVP "viya4/cs/wlt1" CVPMULTIPLIER=4;
/* additional code executed to retrieve the transcodable character variable */
/* Data set is re-created with a new length */
MPRINT(COPY_TO_UTF8): data symbols;
MPRINT(COPY_TO_UTF8): length symbol $3 ;
MPRINT(COPY_TO_UTF8): set _INU8CVP.SYMBOLS;
MPRINT(COPY_TO_UTF8): run;
Note: A warning is displayed for each variable which has a new length. This is expected since we are re-defining the length for some of the variables.
WARNING: Multiple lengths were specified for the variable symbol by input data set(s). This can cause truncation of data.
By running the same PROC PRINT and PROC CONTENTS on the new newly created SAS data set, it contains the expected characters and has new properties: the data set's encoding is now UTF-8 encoded and the length is now 3 (bytes).
Despite using the CVP engine, there is a slight difference regarding the new length in the resulting data set. The CVP engine multiplies the length of all character variables by a constant specified with the CVPMULTIPLIER, whereas the macro re-defines the length only if needed, and uses the largest value found for a given variable.
The macro %COPY_TO_NEW_ENCODING() can also be used to re-create a data set in a different encoding. The data set's encoding must be the same encoding as your SAS session encoding.
%COPY_TO_NEW_ENCODING(from, to, new_encoding);
In this example, the macro is executed from a WLATIN1 SAS session:
%COPY_TO_NEW_ENCODING(wlt1.symbols, u8.symbols_u8, UTF-8);
Similar to %COPY_TO_UTF8(), this macro calculates the minimum length needed by each character variable. It then creates the data set to the specified encoding using a new length definition when needed. The CVP engine is not used here.
Here is the log with the generated code:
%COPY_TO_NEW_ENCODING(wlt1.symbols, u8.symbols_u8, UTF-8);
MPRINT(COPY_TO_NEW_ENCODING): data u8.symbols_u8(encoding="UTF-8");
MPRINT(COPY_TO_NEW_ENCODING): length symbol $3 ;
MPRINT(COPY_TO_NEW_ENCODING): set wlt1.symbols;
MPRINT(COPY_TO_NEW_ENCODING): run;
Yun (Julie) Zhuo. PRA Health Sciences. Transcoding: Understand, Troubleshoot and Resolve a Most Mysterious SAS Error. SAS Communities article.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.