BookmarkSubscribeRSS Feed

Efficiently Migrating Data to UTF-8 Encoding

Started ‎09-26-2023 by
Modified ‎09-22-2023 by
Views 1,297

Migrating an entire library of data sets from other encodings to UTF-8 encoding isn’t as simple as you might think. The main reason is that truncation can occur when characters in the original encoding are converted to an encoding that requires more bytes to represent those same characters. For example, when characters that are encoded as WLATIN1, where every character is represented using 1 byte, are transcoded to UTF-8, where some characters require 2 or more bytes, truncation can occur if the character variable is not wide enough.

 

To better support customers, SAS Viya 4 includes a new generic approach using PROC MIGRATE and CVP (character variable padding) engine to migrate data to UTF-8. The documentation includes instructions on how to use this new approach. In this post, I will look at this new approach.

 

 

What Does the MIGRATE Procedure Do?

 

PROC MIGRATE is usually the best way to migrate members in a SAS library to the current SAS release. PROC MIGRATE is a one-step copy procedure that retains the data attributes that most users want in a data migration. Migrated data sets take on the data representation and encoding attributes of the target library. 

 

However, when migrating a data set that contains non-ASCII characters, extended ASCII characters, or characters 128-255 in single-byte character sets from legacy encodings to UTF-8, truncation might occur because it requires more bytes in UTF-8.

 

This example shows truncation problems when a LATIN1 data set is migrated to UTF-8.

 

  1. Submit this code in a LATIN1 session.

    libname source 'source';
     
    data source.test;
       drink = 'café';
    run;
     
  2. Submit this code in a UTF-8 session. A CEDA error is issued because there are truncated characters.

    libname source 'source';
    libname target 'target';
     
    proc migrate in=source out=target;
    run;
     

    SAS writes the following messages to the log:

     

    NOTE: Migrating SOURCE.TEST to TARGET.TEST (memtype=DATA).
    ERROR: Some character data in the data set "SOURCE.TEST" was lost during 
           transcoding. Truncation occurred during transcoding to the new 
           encoding. To avoid the transcoding error, please refer to 
           "Troubleshooting Truncation and Data loss Issue during Transcoding" in 
           SAS National Language Support (NLS): reference guide.
    NOTE: The SAS System stopped processing this step because of errors.
    This error message usually means that there is not enough space in one or more-character variables in the observation buffer of the data set to convert the data to UTF-8. For more information, see Migrating Data from WLATIN1 to UTF-8 in SAS National Language Support (NLS): Reference Guide.

 

 

Avoid Truncation When Copying a SAS Library

 

Using the CVP engine will pad the character variables and avoid truncation. By default, the CVP engine automatically chooses a multiplier value. The automatic value is usually sufficient to avoid truncation. You can also use the CVPMULTIPLIER= option to specify it yourself. Libraries accessed with the CVP engine are read-only. If you want to save a permanent copy of the data, you need to create new data sets. 

 

To run the example below, first run PROC CONTENTS to see the length of the variables in a data set SOURCE.TEST that was created in the section above.

 
libname source 'source';
 
proc contents data=source.test;
run;

 

In the PROC CONTENTS output, notice the one-character variable. Drink has a length of 4.

 

Portion of PROC CONTENTS output showing variable lengths before expansion

 

                  Alphabetic List of Variables and Attributes
 
                         #    Variable    Type    Len

                         1    drink       Char      4

 

This example uses the CVP engine with the V9 engine to expand the length of character variables. The CVP engine can help you avoid truncation if you copy a data set to an encoding that uses more bytes to represent the characters.

 

libname source cvp 'source' cvpengine=v9 cvpmult=2;
libname target 'target';
 
proc copy in=source out=target;
run;
 
proc contents data=target.test;
run;

 

  1. The first LIBNAME statement assigns the SOURCE library to the CVP engine and the location of the data that you want to copy. The CVPENGINE= option specifies the V9 engine as the underlying engine to process the data. The CVPMULT= option specifies a multiplication factor of 2 to expand all character variables.
  2. The second LIBNAME statement assigns the target library to contain the copied data.
  3. The COPY procedure copies the SOURCE library to the TARGET library. During the copy, the CVP engine expands the character variable lengths 2 times larger.
  4. The CONTENTS procedure shows that the lengths of the character variables have been multiplied by 2. For Drink, 4 × 2 = 8.

 

Portion of PROC CONTENTS output showing variable lengths after expansion

 

                  Alphabetic List of Variables and Attributes
 
                         #    Variable    Type    Len

                         1    drink       Char      8

 

 

Avoid Truncation When Migrating a SAS Library by Using a Two-Step Process

 

Here is another process for migrating a SAS library. In this case, the SOURCE library contains indexes or integrity constraints, which are not supported under CEDA processing by PROC COPY but PROC MIGRATE does. However, PROC MIGRATE does not support the CVP engine in Viya 4 (prior to 2020.1.3 release). Therefore, if you want to migrate indexes or integrity constraints, you must copy the library with the CVP engine first and then migrate (in other words, a two-step process).

 

To run this example, first create a data set that has an index. Submit this code in a LATIN1 session.

 

libname source 'source';
 
data source.class (index=(age));
   set sashelp.class;
run;

 

Here is the two-step process to avoid data truncation while migrating a library:

 

  1. To avoid CEDA processing, submit this code in the SOURCE environment (LATIN1 session) where the data was created. Use PROC COPY with the CVP engine to expand the variable length for all character variables. Do not specify NOCLONE.

    libname source cvp 'source' cvpmult=2;
    libname copy 'source-copy';
     
    proc copy in=source out=copy constraint=yes;
    run;
     
  2. Submit this code in the TARGET environment (UTF-8 session). Use PROC MIGRATE to migrate the library.

    libname copy 'source-copy';
    libname target 'target';
     
    proc migrate in=copy out=target;
    run;

     

    SAS writes the following messages to the log, notice the simple index has been recreated by PROC MIGRATE.

     
    NOTE: The BUFSIZE= option was not specified with the MIGRATE procedure. The 
          migrated library members will use the current value for BUFSIZE. For 
          more information, see the PROC MIGRATE documentation.
    NOTE: Migrating COPY.CLASS to TARGET.CLASS (memtype=DATA).
    NOTE: Simple index Age has been defined.
    NOTE: The data set TARGET.CLASS has 19 observations and 5 variables.
    NOTE: Migrating COPY.TEST to TARGET.TEST (memtype=DATA).
    NOTE: The data set TARGET.TEST has 1 observations and 1 variables.
 

 

Introduce New Integration between PROC MIGRATE and CVP Engine to Avoid Truncation

 

In SAS Viya 4 (2020.1.3 and later releases), PROC MIGRATE supports using the CVP engine to avoid truncation in a migration so that you are no longer required to run at least 2 steps to convert SAS data sets and other files to UTF-8. 

 

Here is an example of migrate with the CVP engine to avoid truncation. Submit this code in a UTF-8 session.

 

libname source cvp 'source' cvpmult=2;
libname target 'target';
 
proc migrate in=source out=target;
run;
 
proc contents data=target.test;
proc contents data=target.class;
run;

 

  1. The first LIBNAME statement assigns the SOURCE library to the CVP engine and the location of the data that you want to migrate. The CVPMULT= option specifies a multiplication factor of 2 to expand all character variables.
  2. The second LIBNAME statement assigns the target library to contain the migrated data.
  3. The MIGRATE procedure migrates the SOURCE library to the TARGET library. During the migration, the CVP engine expands the character variable lengths 2 times.
  4. The first CONTENTS procedure shows that the lengths of the character variables have been multiplied by 2. For Drink, 4 × 2 = 8.
  5. The second CONTENTS procedure shows that the simple index has been recreated.

 

The following SAS log messages indicate a successful migration.

 

NOTE: The BUFSIZE= option was not specified with the MIGRATE procedure. The 
      migrated library members will use the current value for BUFSIZE. For 
      more information, see the PROC MIGRATE documentation.
NOTE: Migrating SOURCE.CLASS to TARGET.CLASS (memtype=DATA).
NOTE: Simple index Age has been defined.
NOTE: The data set TARGET.CLASS has 19 observations and 5 variables.
NOTE: Migrating SOURCE.TEST to TARGET.TEST (memtype=DATA).
NOTE: The data set TARGET.TEST has 1 observations and 1 variables.

 

Portion of PROC CONTENTS output showing variable lengths after migration

 

                  Alphabetic List of Variables and Attributes
 
                         #    Variable    Type    Len

                         1    drink       Char      8

 

Portion of PROC CONTENTS output showing indexes and attributes after migration

 

                   Alphabetic List of Indexes and Attributes
 
                                             # of
                                           Unique
                             #    Index    Values

                             1    Age           6

 

 

Summary

 

In SAS Viya 4, PROC MIGRATE works with the CVP engine. If you change to a different character encoding that uses more bytes to represent the characters, you might want to use the CVP engine as part of the migration process.

 

 

Reference

 

Version history
Last update:
‎09-22-2023 04:01 AM
Updated by:
Contributors

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags