BookmarkSubscribeRSS Feed

Migrating SAS Data sets to UTF-8 Encoding with SAS Macros

Started ‎10-11-2023 by
Modified ‎08-14-2023 by
Views 1,586

Overview 

UTF-8 encoding supports multilingual data and is the default session encoding for SAS Viya. 

There are several ways to migrate a SAS data set to UTF-8 encoding:

  • Using the CVP engine
  • PROC MIGRATE
  • PROC DATASETS
  • SAS Macros %COPY_TO_UTF8 and  %COPY_TO_NEW_ENCODING  

This document demonstrates and explains how to migrate a SAS data set to UTF-8 encoding using the two macros.

 

SAS Session Encoding

The session encoding establishes the environment to process SAS syntax and SAS data sets, and to read and write external files.  To explain a session encoding, let's draw an analogy with a car's engine. 

The SAS session encoding is like the car's engine. The data's encoding is the type of fuel. If your car's engine has been built to run with diesel fuel, using salty water won't work. It is the same for the SAS system. Using data encoded in one encoding might not work if your data's encoding does not match your SAS session encoding. However, unlike the car's features, the SAS system can transform data from one encoding to another one before it processes them.

 

Context

To demonstrate the two macros, we are going to convert a WLATIN1 SAS data set to UTF-8.

First, let's create our WLATIN1 sample SAS dataset from a WLATIN1 SAS session.

libname wlt1 'wlt1 library' ;
 
data wlt1.symbols;
length symbol $1;
input symbol;
cards;
€
£
¥
+
¢
;
run;

When executing PROC PRINT and PROC CONTENTS on that data set: 

proc print data=wlt1.symbols noobs ;
Title "SAS Session Encoding: %SYSFUNC(GETOPTION(ENCODING))";
run ;
proc contents data=wlt1.symbols;
run ;  

it should return information similar to the following:

 

wlatin1_ds.png                     wlatin1_de_properties.png

 

Now, let's attempt to read and create a UTF-8 version of the data set from WLATIN1 and from a UTF-8 SAS session using the SET statement.

 

SAS Session Encoding
SAS program to execute
Comments
WLATIN1
data symbols_utf8 (encoding="UTF-8");
   set wlt1.symbols;
run ;
We use the encoding data set option to specify how to create the output data set from the WLATIN1 SAS session.
UTF-8
data symbols_utf8;
    set wlt1.symbols;
run;
 

  

In both cases, and depending on the SAS versions, the SAS log displays one of the following error messages. Message #2 is the expected and correct message.

#
Error Messages
SAS Versions
1 ERROR: Some character data was lost during transcoding in the data set SBCS.SYMBOLS. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding. SAS9, Viya 3.5 and early Viya 4 versions
2 ERROR: Some character data was lost during transcoding in the data set "WORK.SYMBOLS_UTF8". Truncation occurred during transcoding to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide. Starting Viya 4 2023.08

 

Why are we getting the following statement in error message 2?

ERROR: Truncation occurred during transcoding to the new encoding.

In our WLATIN1 SAS data set, all the characters are encoded on 1 byte. Also, the length for the variable 'symbol, is defined with $1, which means the variable can accommodate only 1 byte.

Once transcoded, internally, some of these characters are now encoded with more bytes. The Euro character €, for example, is encoded on 3 bytes (See table below) while other characters are encoded on 2 or 1 bytes. However, the initial variable length is not changed and is still defined as $1 which is not big enough to receive the 2 or 3 bytes required by some of these characters. 

 

WLATIN1 Symbol UTF-8
Hexadecimal representation Length Length Hexadecimal representation
'80'x 1   3 'E282AC'x
'A3'x 1  £ 2 'C2A3'x
'A5'x 1 ¥ 2 'C2A5'x
'2B'x 1 + 1 '2B'x
'A2'x 1 ¢ 2 'C2A2'x

 

Migrate a SAS Data Set from a UTF-8 SAS Session

 

When running from a UTF-8 SAS session, the %COPY_TO_UTF8 macro can help migrate the SAS data set.

 

%COPY_TO_UTF8(from, to)

 

The %COPY_TO_UTF8 macro creates a new version of a data set. The macro can be used only on a UTF-8 SAS session and can import data sets from any encoding. 

%COPY_TO_UTF8(wlt1.symbols, symbols);

The macro calculates the minimum required length needed for each character variable in the data set and defines a new LENGTH statement when re-creating the SAS data set.

To accomplish this, the macro makes use of the CVP engine with the CVPMULTIPLIER set to 4 to ensure that there is no truncation during the transcoding process. Then the new length is calculated based on the new encoded string after the entire data set is read. 

In our example, the new length needed is now 3 (bytes), and the SAS data set is re-created with the new length definition.

 

Generated code seen from the log:

 

MPRINT(COPY_TO_UTF8):   libname _INU8CVP CVP "viya4/cs/wlt1" CVPMULTIPLIER=4;
/* additional code executed to retrieve the transcodable character variable */

/* Data set is re-created with a new length */
MPRINT(COPY_TO_UTF8):   data symbols;
MPRINT(COPY_TO_UTF8):   length symbol $3 ;
MPRINT(COPY_TO_UTF8):   set _INU8CVP.SYMBOLS;
MPRINT(COPY_TO_UTF8):   run;

 

 

 

Note: A warning is displayed for each variable which has a new length. This is expected since we are re-defining the length for some of the variables. 
WARNING: Multiple lengths were specified for the variable symbol by input data set(s). This can cause truncation of data.

 

By running the same PROC PRINT and PROC CONTENTS on the new newly created SAS data set, it contains the expected characters and has new properties:  the data set's encoding is now UTF-8 encoded and the length is now 3 (bytes).

utf8_ds.png    utf8_ds_properties.png

Despite using the CVP engine, there is a slight difference regarding the new length in the resulting data set.  The CVP engine multiplies the length of all character variables by a constant specified with the CVPMULTIPLIER, whereas the macro re-defines the length only if needed, and uses the largest value found for a given variable.

 

Migrate from a non-UTF-8 SAS Session

 

The macro %COPY_TO_NEW_ENCODING() can also be used to re-create a data set in a different encoding.  The data set's encoding must be the same encoding as your SAS session encoding.

 

%COPY_TO_NEW_ENCODING(from, to, new_encoding);

 

In this example, the macro is executed from a WLATIN1 SAS session:

%COPY_TO_NEW_ENCODING(wlt1.symbols, u8.symbols_u8, UTF-8);

Similar to %COPY_TO_UTF8(), this macro calculates the minimum length needed by each character variable. It then creates the data set to the specified encoding using a new length definition when needed. The CVP engine is not used here.

 

Here is the log with the generated code:

 

%COPY_TO_NEW_ENCODING(wlt1.symbols, u8.symbols_u8, UTF-8);

MPRINT(COPY_TO_NEW_ENCODING):   data u8.symbols_u8(encoding="UTF-8");
MPRINT(COPY_TO_NEW_ENCODING):   length symbol $3 ;
MPRINT(COPY_TO_NEW_ENCODING):   set wlt1.symbols;
MPRINT(COPY_TO_NEW_ENCODING):   run;

 

References

Version history
Last update:
‎08-14-2023 02:46 PM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags