Migrating an entire library of data sets from other encodings to UTF-8 encoding isn’t as simple as you might think. The main reason is that truncation can occur when characters in the original encoding are converted to an encoding that requires more bytes to represent those same characters. For example, when characters that are encoded as WLATIN1, where every character is represented using 1 byte, are transcoded to UTF-8, where some characters require 2 or more bytes, truncation can occur if the character variable is not wide enough.
To better support customers, SAS Viya 4 includes a new generic approach using PROC MIGRATE and CVP (character variable padding) engine to migrate data to UTF-8. The documentation includes instructions on how to use this new approach. In this post, I will look at this new approach.
What Does the MIGRATE Procedure Do?
PROC MIGRATE is usually the best way to migrate members in a SAS library to the current SAS release. PROC MIGRATE is a one-step copy procedure that retains the data attributes that most users want in a data migration. Migrated data sets take on the data representation and encoding attributes of the target library.
However, when migrating a data set that contains non-ASCII characters, extended ASCII characters, or characters 128-255 in single-byte character sets from legacy encodings to UTF-8, truncation might occur because it requires more bytes in UTF-8.
This example shows truncation problems when a LATIN1 data set is migrated to UTF-8.
Submit this code in a LATIN1 session.
libname source 'source';
data source.test;
drink = 'café';
run;
Submit this code in a UTF-8 session. A CEDA error is issued because there are truncated characters.
libname source 'source';
libname target 'target';
proc migrate in=source out=target;
run;
SAS writes the following messages to the log:
NOTE: Migrating SOURCE.TEST to TARGET.TEST (memtype=DATA).
ERROR: Some character data in the data set "SOURCE.TEST" was lost during
transcoding. Truncation occurred during transcoding to the new
encoding. To avoid the transcoding error, please refer to
"Troubleshooting Truncation and Data loss Issue during Transcoding" in
SAS National Language Support (NLS): reference guide.
NOTE: The SAS System stopped processing this step because of errors.
This error message usually means that there is not enough space in one or more-character variables in the observation buffer of the data set to convert the data to UTF-8. For more information, see Migrating Data from WLATIN1 to UTF-8 in SAS National Language Support (NLS): Reference Guide.
Avoid Truncation When Copying a SAS Library
Using the CVP engine will pad the character variables and avoid truncation. By default, the CVP engine automatically chooses a multiplier value. The automatic value is usually sufficient to avoid truncation. You can also use the CVPMULTIPLIER= option to specify it yourself. Libraries accessed with the CVP engine are read-only. If you want to save a permanent copy of the data, you need to create new data sets.
To run the example below, first run PROC CONTENTS to see the length of the variables in a data set SOURCE.TEST that was created in the section above.
libname source 'source';
proc contents data=source.test;
run;
In the PROC CONTENTS output, notice the one-character variable. Drink has a length of 4.
Portion of PROC CONTENTS output showing variable lengths before expansion
Alphabetic List of Variables and Attributes
# Variable Type Len
1 drink Char 4
This example uses the CVP engine with the V9 engine to expand the length of character variables. The CVP engine can help you avoid truncation if you copy a data set to an encoding that uses more bytes to represent the characters.
libname source cvp 'source' cvpengine=v9 cvpmult=2;
libname target 'target';
proc copy in=source out=target;
run;
proc contents data=target.test;
run;
The first LIBNAME statement assigns the SOURCE library to the CVP engine and the location of the data that you want to copy. The CVPENGINE= option specifies the V9 engine as the underlying engine to process the data. The CVPMULT= option specifies a multiplication factor of 2 to expand all character variables.
The second LIBNAME statement assigns the target library to contain the copied data.
The COPY procedure copies the SOURCE library to the TARGET library. During the copy, the CVP engine expands the character variable lengths 2 times larger.
The CONTENTS procedure shows that the lengths of the character variables have been multiplied by 2. For Drink, 4 × 2 = 8.
Portion of PROC CONTENTS output showing variable lengths after expansion
Alphabetic List of Variables and Attributes
# Variable Type Len
1 drink Char 8
Avoid Truncation When Migrating a SAS Library by Using a Two-Step Process
Here is another process for migrating a SAS library. In this case, the SOURCE library contains indexes or integrity constraints, which are not supported under CEDA processing by PROC COPY but PROC MIGRATE does. However, PROC MIGRATE does not support the CVP engine in Viya 4 (prior to 2020.1.3 release). Therefore, if you want to migrate indexes or integrity constraints, you must copy the library with the CVP engine first and then migrate (in other words, a two-step process).
To run this example, first create a data set that has an index. Submit this code in a LATIN1 session.
libname source 'source';
data source.class (index=(age));
set sashelp.class;
run;
Here is the two-step process to avoid data truncation while migrating a library:
To avoid CEDA processing, submit this code in the SOURCE environment (LATIN1 session) where the data was created. Use PROC COPY with the CVP engine to expand the variable length for all character variables. Do not specify NOCLONE.
libname source cvp 'source' cvpmult=2;
libname copy 'source-copy';
proc copy in=source out=copy constraint=yes;
run;
Submit this code in the TARGET environment (UTF-8 session). Use PROC MIGRATE to migrate the library.
libname copy 'source-copy';
libname target 'target';
proc migrate in=copy out=target;
run;
SAS writes the following messages to the log, notice the simple index has been recreated by PROC MIGRATE.
NOTE: The BUFSIZE= option was not specified with the MIGRATE procedure. The
migrated library members will use the current value for BUFSIZE. For
more information, see the PROC MIGRATE documentation.
NOTE: Migrating COPY.CLASS to TARGET.CLASS (memtype=DATA).
NOTE: Simple index Age has been defined.
NOTE: The data set TARGET.CLASS has 19 observations and 5 variables.
NOTE: Migrating COPY.TEST to TARGET.TEST (memtype=DATA).
NOTE: The data set TARGET.TEST has 1 observations and 1 variables.
Introduce New Integration between PROC MIGRATE and CVP Engine to Avoid Truncation
In SAS Viya 4 (2020.1.3 and later releases), PROC MIGRATE supports using the CVP engine to avoid truncation in a migration so that you are no longer required to run at least 2 steps to convert SAS data sets and other files to UTF-8.
Here is an example of migrate with the CVP engine to avoid truncation. Submit this code in a UTF-8 session.
libname source cvp 'source' cvpmult=2;
libname target 'target';
proc migrate in=source out=target;
run;
proc contents data=target.test;
proc contents data=target.class;
run;
The first LIBNAME statement assigns the SOURCE library to the CVP engine and the location of the data that you want to migrate. The CVPMULT= option specifies a multiplication factor of 2 to expand all character variables.
The second LIBNAME statement assigns the target library to contain the migrated data.
The MIGRATE procedure migrates the SOURCE library to the TARGET library. During the migration, the CVP engine expands the character variable lengths 2 times.
The first CONTENTS procedure shows that the lengths of the character variables have been multiplied by 2. For Drink, 4 × 2 = 8.
The second CONTENTS procedure shows that the simple index has been recreated.
The following SAS log messages indicate a successful migration.
NOTE: The BUFSIZE= option was not specified with the MIGRATE procedure. The
migrated library members will use the current value for BUFSIZE. For
more information, see the PROC MIGRATE documentation.
NOTE: Migrating SOURCE.CLASS to TARGET.CLASS (memtype=DATA).
NOTE: Simple index Age has been defined.
NOTE: The data set TARGET.CLASS has 19 observations and 5 variables.
NOTE: Migrating SOURCE.TEST to TARGET.TEST (memtype=DATA).
NOTE: The data set TARGET.TEST has 1 observations and 1 variables.
Portion of PROC CONTENTS output showing variable lengths after migration
Alphabetic List of Variables and Attributes
# Variable Type Len
1 drink Char 8
Portion of PROC CONTENTS output showing indexes and attributes after migration
Alphabetic List of Indexes and Attributes
# of
Unique
# Index Values
1 Age 6
Summary
In SAS Viya 4, PROC MIGRATE works with the CVP engine. If you change to a different character encoding that uses more bytes to represent the characters, you might want to use the CVP engine as part of the migration process.
Reference
Migrating Data to UTF-8 for the SAS Viya Platform
... View more