Migrating an entire library of data sets from other encodings to UTF-8 encoding isn’t as simple as you might think. The main reason is that truncation can occur when characters in the original encoding are converted to an encoding that requires more bytes to represent those same characters. For example, when characters that are encoded as WLATIN1, where every character is represented using 1 byte, are transcoded to UTF-8, where some characters require 2 or more bytes, truncation can occur if the character variable is not wide enough.
To better support customers, SAS Viya 4 includes a new generic approach using PROC MIGRATE and CVP (character variable padding) engine to migrate data to UTF-8. The documentation includes instructions on how to use this new approach. In this post, I will look at this new approach.
PROC MIGRATE is usually the best way to migrate members in a SAS library to the current SAS release. PROC MIGRATE is a one-step copy procedure that retains the data attributes that most users want in a data migration. Migrated data sets take on the data representation and encoding attributes of the target library.
However, when migrating a data set that contains non-ASCII characters, extended ASCII characters, or characters 128-255 in single-byte character sets from legacy encodings to UTF-8, truncation might occur because it requires more bytes in UTF-8.
This example shows truncation problems when a LATIN1 data set is migrated to UTF-8.
Submit this code in a LATIN1 session.
libname source 'source';
data source.test;
drink = 'café';
run;
Submit this code in a UTF-8 session. A CEDA error is issued because there are truncated characters.
libname source 'source';
libname target 'target';
proc migrate in=source out=target;
run;
SAS writes the following messages to the log:
NOTE: Migrating SOURCE.TEST to TARGET.TEST (memtype=DATA). ERROR: Some character data in the data set "SOURCE.TEST" was lost during transcoding. Truncation occurred during transcoding to the new encoding. To avoid the transcoding error, please refer to "Troubleshooting Truncation and Data loss Issue during Transcoding" in SAS National Language Support (NLS): reference guide. NOTE: The SAS System stopped processing this step because of errors.
Avoid Truncation When Copying a SAS Library
Using the CVP engine will pad the character variables and avoid truncation. By default, the CVP engine automatically chooses a multiplier value. The automatic value is usually sufficient to avoid truncation. You can also use the CVPMULTIPLIER= option to specify it yourself. Libraries accessed with the CVP engine are read-only. If you want to save a permanent copy of the data, you need to create new data sets.
To run the example below, first run PROC CONTENTS to see the length of the variables in a data set SOURCE.TEST that was created in the section above.
libname source 'source';
proc contents data=source.test;
run;
In the PROC CONTENTS output, notice the one-character variable. Drink has a length of 4.
Portion of PROC CONTENTS output showing variable lengths before expansion
Alphabetic List of Variables and Attributes # Variable Type Len 1 drink Char 4
This example uses the CVP engine with the V9 engine to expand the length of character variables. The CVP engine can help you avoid truncation if you copy a data set to an encoding that uses more bytes to represent the characters.
libname source cvp 'source' cvpengine=v9 cvpmult=2;
libname target 'target';
proc copy in=source out=target;
run;
proc contents data=target.test;
run;
Portion of PROC CONTENTS output showing variable lengths after expansion
Alphabetic List of Variables and Attributes # Variable Type Len 1 drink Char 8
Here is another process for migrating a SAS library. In this case, the SOURCE library contains indexes or integrity constraints, which are not supported under CEDA processing by PROC COPY but PROC MIGRATE does. However, PROC MIGRATE does not support the CVP engine in Viya 4 (prior to 2020.1.3 release). Therefore, if you want to migrate indexes or integrity constraints, you must copy the library with the CVP engine first and then migrate (in other words, a two-step process).
To run this example, first create a data set that has an index. Submit this code in a LATIN1 session.
libname source 'source';
data source.class (index=(age));
set sashelp.class;
run;
Here is the two-step process to avoid data truncation while migrating a library:
To avoid CEDA processing, submit this code in the SOURCE environment (LATIN1 session) where the data was created. Use PROC COPY with the CVP engine to expand the variable length for all character variables. Do not specify NOCLONE.
libname source cvp 'source' cvpmult=2;
libname copy 'source-copy';
proc copy in=source out=copy constraint=yes;
run;
Submit this code in the TARGET environment (UTF-8 session). Use PROC MIGRATE to migrate the library.
libname copy 'source-copy';
libname target 'target';
proc migrate in=copy out=target;
run;
SAS writes the following messages to the log, notice the simple index has been recreated by PROC MIGRATE.
NOTE: The BUFSIZE= option was not specified with the MIGRATE procedure. The migrated library members will use the current value for BUFSIZE. For more information, see the PROC MIGRATE documentation. NOTE: Migrating COPY.CLASS to TARGET.CLASS (memtype=DATA). NOTE: Simple index Age has been defined. NOTE: The data set TARGET.CLASS has 19 observations and 5 variables. NOTE: Migrating COPY.TEST to TARGET.TEST (memtype=DATA). NOTE: The data set TARGET.TEST has 1 observations and 1 variables.
In SAS Viya 4 (2020.1.3 and later releases), PROC MIGRATE supports using the CVP engine to avoid truncation in a migration so that you are no longer required to run at least 2 steps to convert SAS data sets and other files to UTF-8.
Here is an example of migrate with the CVP engine to avoid truncation. Submit this code in a UTF-8 session.
libname source cvp 'source' cvpmult=2;
libname target 'target';
proc migrate in=source out=target;
run;
proc contents data=target.test;
proc contents data=target.class;
run;
The following SAS log messages indicate a successful migration.
NOTE: The BUFSIZE= option was not specified with the MIGRATE procedure. The migrated library members will use the current value for BUFSIZE. For more information, see the PROC MIGRATE documentation. NOTE: Migrating SOURCE.CLASS to TARGET.CLASS (memtype=DATA). NOTE: Simple index Age has been defined. NOTE: The data set TARGET.CLASS has 19 observations and 5 variables. NOTE: Migrating SOURCE.TEST to TARGET.TEST (memtype=DATA). NOTE: The data set TARGET.TEST has 1 observations and 1 variables.
Portion of PROC CONTENTS output showing variable lengths after migration
Alphabetic List of Variables and Attributes # Variable Type Len 1 drink Char 8
Portion of PROC CONTENTS output showing indexes and attributes after migration
Alphabetic List of Indexes and Attributes # of Unique # Index Values 1 Age 6
In SAS Viya 4, PROC MIGRATE works with the CVP engine. If you change to a different character encoding that uses more bytes to represent the characters, you might want to use the CVP engine as part of the migration process.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.