SAS 9.4 Latin1 encoding to SAS Viya4 UTF-8 encoding

OMH · Posted 12-07-2023 02:07 AM

SAS 9.4 Latin1 encoding to SAS Viya4 UTF-8 encoding

Abstract

Moving your data from a Latin1 encoding originating in your SAS 9.4 environment either on Windows or Linux to a SAS Viya4 Compute or CAS engine running in a UTF-8 encoding originates certain measure to be taken to migrate successfully. In this Juletip I will showcase what can happen if we fail to do the measures, the consequences of such failure and the measures that need to be done when migrating to SAS Viya4 from SAS 9.4.

Introduction

UTF-8 stands for Unicode Transformation Format 8-bit and is a variable-width character encoding. It can represent every character in the Unicode character set, making it a universal encoding for text. UTF-8 is commonly used for internationalization because it supports a wide range of characters from different languages and scripts.

Internationalization – i18n

I18n as SAS call their internationalization report from the SAS Content Assessment framework stands for the letter i pluss 18 letters in between the ending letter n.

Why is Internationalization (i18n) Important in a SAS 9.4 to SAS Viya4 Migration?
The Data and Program Compatibility becomes important when migrating data and programs from SAS 9.4 to SAS Viya4. SAS Viya4 is designed to support a more diverse and global set of data and applications and uses UTF-8 as default encoding.

SAS Viya4 is using a UTF-8 encoding as its preferred encoding over LATIN1 because it supports a broader range of characters from various languages. This ensures that data containing characters from different languages can be accurately represented and processed in SAS Viya4. As organizations operate on a global scale, their data and programs may need to handle diverse languages and character sets. UTF-8 allows for scalability and flexibility in handling diverse linguistic and cultural data.

Example of problem that can occur failing the migration steps if you are reading SAS9 datasets created with any other encoding the log may contain this message:

ERROR: Some character data was lost during transcoding in the data set ZHOLD.CARS. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding

This error usually means that there is not enough space in one or more-character columns in the data set's observation buffer to convert the data to UTF-8.
If this error occurs, you can use the character variable padding (CVP) LIBNAME engine to create an in-memory copy of the data that has larger character columns. The CVP engine adds space to the character columns. By default, the column length is multiplied by 1.5. Use the CVP option CVPMULT= to control the amount of padding.

Some characters may require more space -like the “€” and the “æ”, “ø” and “å” from Norway and characters with e.g. an apostrophe like “ó” from France and other countries.

Using the wrong encoding settings may cause some text to be stored, displayed and interpreted incorrectly.

Examples of Latin1 to UTF-8 conversion code:

To be able to convert the non-standard ASCII euro-“€” sign stored in a SAS 9.4 SAS dataset – you need to expand the variable that holds the character – here the charvar1 in the example below. We are using the cvpmultiplier=5 to allow this expansion.

SAS Code example:
* adjust the PATH to something that fits your running environment.

/* SAS 9.4 libname for storing a non-default ASCII character dataset */
libname sas9in "PATH*\sas9data" ;

/* Create a SAS 9.4 data set with a non-default ASCII character */
data sas9in.mydata;
  charvar1 = "€";
run;

/* Clear the libname containing the non-default ASCII character */
/* so that we can reassign this using the cvp engine and options*/
libname _all_ clear;

/* Assign the libname containing the non-default ASCII character */
/* using the cvp engine, inencoding and cvp options. 		     */
/* The cvp engine creates a read only libname.	 		     */
/* The inencoding tells what encoding that the data is created   */ 
/* with The cvpmultiplier multiplies the character variables     */ 
/* length with 5 bytes – as the charvar1 is 1 byte it will become*/
/* 5 bytes									     */ 
libname sas9data cvp "PATH*\sas9data" inencoding=latin1 cvpmultiplier=5;

/* Assign the UTF-8 output libname that we copy data to using the */ 
/* outencoding option to specify encoding that result dataset are */
/* using									      */
libname utf8data "PATH*\utf8data" outencoding=utf8 ;

/* Convert the SAS 9.4 data set to UTF-8 encoding using proc copy	*/
/* and noclone option – preventing to copy data set attributes to */
/* the new dataset 								*/
proc copy noclone in=sas9data out=utf8data;
  select mydata;
run;

/* Compare the two UTF-8 data sets 						*/
proc compare data=sas9data.MYDATA compare=utf8data.MYDATA;
run;

Explanation in details of what happens:

Step 1: Define the SAS 9.4 Data Library

libname sas9in "PATH*\sas9data" ;

This statement creates a SAS library reference named sas9in that points to the SAS 9.4 data directory located at the specified path. This allows SAS to access the data sets stored in the SAS 9.4 library.

Step 2: Create a SAS 9.4 Data Set with Non-Default ASCII Character

data sas9in.mydata;
  charvar1 = "€";
run;

Variable charvar1 has a single-1 byte in Length representation in Latin1.

This section creates a SAS 9.4 data set named mydata within the sas9in library. It defines a character variable charvar1 and assigns it the non-default ASCII character, the euro symbol (€).

Step 3: Clear the Libname

libname _all_ clear;

The libname _all_ clear; statement removes all existing SAS library references, including the sas9in library. This is necessary to avoid conflicts when re-creating the library with specific encoding options.

Step 4: Reassign the Libname with Encoding Options

libname sas9data cvp "PATH*\sas9data" inencoding=latin1 cvpmultiplier=5;

This line redefines the sas9data library reference using the CVP (Character Variable Processing) engine. The inencoding=latin1 option specifies that the data within the library is encoded in Latin1 encoding. The cvpmultiplier=5 option indicates that the CVP engine should be used to process character variables, and it sets the multiplier for character variable lengths to 5.

Step 5: Create the UTF-8 Output Libname

libname utf8data "PATH*\utf8data" outencoding=utf8 ;

This statement creates a SAS library reference named utf8data that points to the UTF-8 data directory located at the specified path. It also sets the outencoding option to utf8, indicating that any data written to this library should be encoded in UTF-8 format.

Step 6: Convert Data Set Encoding

proc copy noclone in=sas9data out=utf8data;

This section utilizes the proc copy procedure to copy the mydata data set from the sas9data library (Latin1 encoding) to the utf8data library (UTF-8 encoding). The noclone option ensures that the data is not copied multiple times if the data set already exists in the destination library. The select statement specifies that only the mydata data set should be copied.

Step 7: Compare Data Sets

Properties for input dataset – using the cvpmultiplier option:

Variable charvar1 has a five-5 byte in Length representation in UTF-8 representation.
Output inherits the length defined in input libname to hold special character “€”:

Properties for output dataset using UTF-8:

Variable charvar1 has a five-5 byte in Length representation in UTF-8 representation.

Opening the utf8data.mydata dataset in SAS Studio on Viya4 shows the right character for the “€” character in the charvar1 variable in SAS Viya:

If we try to open the sas9data.mydata datasets in SAS Studio in SAS Viya4 without the migration steps needed by just copy the dataset into SAS Viya4 we 
will not be able to open it – it will give an error message:

Running proc compare:

Running the proc compare procedure to secure we are aligned on exact equal datasets.

proc compare data=sas9data.MYDATA compare=utf8data.MYDATA;
run;

This statement compares the original mydata data set in the sas9data library (Latin1 encoding) with the converted mydata data set in the utf8data library (UTF-8 encoding). It checks for any discrepancies between the data sets to ensure that the conversion process was successful.

Output from Proc Compare:

If we do not honor the difference between LATIN1 and UTF-8 encoding, we will have “garbage” in the dataset when moving either SAS datasets from SAS 9.4 to SAS Viya or also from SAS Viya to a database.

“Garbage” in character variable(s):

Running in SAS 9.4 Foundation client and Latin1 encoding:
This SAS code creates a temporary dataset with a single data step (using data _null_;). The dataset doesn't store any actual data but is used to perform some character manipulation and output the results.

data _null_ ;
    str1= "€123" ;
    s1=substr(str1,1,1) ; sl1=length(s1); l1=length(str1) ;
   put s1= / sl1= / l1= / ;
run ;

Here's a step-by-step explanation:

The output will look like this:
 s1=€, sl1=1, l1=4

So, the final output indicates that the substring `s1` is "€" with a length of 1 character, and the original string `str1` has a length of 4 characters.

Running in SAS Viya SAS Studio client and UTF-8 encoding:
This SAS code creates a temporary dataset with a single data step (using `data _null_;`). The dataset doesn't store any actual data but is used to perform some character manipulation and output the results.

Differences when run using SAS 9.4 and Latin1 encoding:

data _null_ ;
    str1= "€123" ;
    s1=substr(str1,1,1) ; sl1=length(s1); l1=length(str1) ;
   put s1= / sl1= / l1= / ;
run ;

The output will look like this:
s1=�, sl1=1, l1=6

In this SAS code, the character encoding appears to be UTF-8, and it attempts to process a string containing a Euro symbol (€) along with numeric characters "123". However, there seems to be a discrepancy in the encoding interpretation, leading to unexpected output. Let's break it down:

s1=�: The output for s1 is "�" instead of "€". This symbol (�) often represents a character that cannot be properly encoded or decoded due to a mismatch in character set interpretation or encoding issues.
sl1=1: The output for sl1 is 1, which indicates that the length of the substring s1 is correctly calculated as 1, despite the encoding issue.
l1=6: The output for l1 is 6, which indicates that the length of the original string `str1` is incorrectly calculated as 6. This discrepancy is likely due to the misinterpretation of the Euro symbol's encoding.

In the previous code, where the correct UTF-8 encoding was assumed, the Euro symbol "€" was correctly processed, resulting in s1=€, sl1=1, l1=4.

In this code, where there seems to be an issue with encoding interpretation, the Euro symbol is not correctly processed, resulting in s1=�, sl1=1, l1=6.

The key difference lies in the correct interpretation of the Euro symbol, and the discrepancy in lengths is a consequence of the encoding issue. It highlights the importance of ensuring consistent and accurate character encoding, especially when working with multilingual or special characters. This will be the same for all NON-ASCII characters when migrating to UTF-8.

How to fix discrepancy of character functions?

Example:
The name Calderón – consists of 8 characters and contains the ó which is encoded as C3B3 (2 bytes).

The SAS length function reports the length of Calderón as 9 (bytes) which is wrong.

The SAS klength function reports the length as 8 (characters) which is right.

Illustrating what happens with a code example:

data _null_ ;
  str1= "Calderón" ;
  s1=substr(str1,7,1) ; sl1=length(s1); l1=length(str1) ;
  s2=substr(str1,7,2) ; sl2=length(s2); l2=length(str1) ;
  put s1= / sl1= / l1= / ;
  put s2= / sl2= / l2= / ;

  s3=ksubstr(str1,7,1) ; sl3=klength(s3); l3=klength(str1) ;
  put s3= / sl3= / l3= / ;
 run ;

Output: 
s1=�,sl1=1,l1=9 (Garbage in s1 string)
s2=ó,sl2=2,l2=9 (Correct representation using 2 bytes as length for s2)
s3=ó,sl3=1,l3=8 (Correct representation using K functions)

data _null_ ;
  str1= "Calderón" ;
  s1=substr(str1,1,8) ; sl1=length(s1); l1=length(str1) ;
  put s1= / sl1= / l1= / ;
  s3=ksubstr(str1,1,8) ; sl3=klength(s3); l3=klength(str1) ;
  put s3= / sl3= / l3= / ;
run ;

Output:
s1=Calderó, sl1=8, l1=9 (Truncated s1 string)
s3=Calderón, sl3=8, l3=8 (Correct representation using K functions)

data _null_ ;
 str1= "Calderón" ;
 s1=substr(str1,1,9) ; sl1=length(s1); l1=length(str1) ;
 put s1= / sl1= / l1= / ;
 s3=ksubstr(str1,1,8) ; sl3=klength(s3); l3=klength(str1) ;
 put s3= / sl3= / l3= / ;
 run ;

Output:
s1=Calderón, sl1=9, l1=9 (Correct s1 string - using 9 length which is "impossible" to anticipate)
s3=Calderón, sl3=8, l3=8 (Correct representation using K functions)

Where the third example (s3) is right using the kfunctions.

The minimum size of the character variable should be the number from the length function. If not the "Calderón" will become truncated – "Calderó" when moved to UTF-8 and SAS Viya.

Before replacing all the original SAS string-handling functions with K functions, examine your SAS program. If the string function processes data that contains only single-byte characters, K functions are not necessary.

For example, strings containing XML tags do not require the use of K functions. Knowing the character data that is in your SAS programs and how it is processed can save unnecessary updates to your SAS code. The processing of binary data is not supported by the string-handling K functions, which expect strings to match the current session encoding. UTF-8 is the only SAS session encoding supported by SAS Viya. String functions are assigned I18N levels depending on whether the functions can process MBCS or SBCS data. Replacing SAS string-handling with K-functions.

We have toolsets that can scan your datasets to identify if there are non-ascii characters in use and how much space the character variable needs to be expanded with. Please contact us for more information.

Summary:

This example shows that it is important to use UTF-8 encoding when migrating SAS data sets from SAS 9.4 to SAS Viya. UTF-8 encoding can represent a wider range of characters than Latin1 encoding, and it is the default encoding for SAS Viya. Therefore, using UTF-8 encoding will help to ensure that your data is migrated correctly and that it is compatible with SAS Viya.

More information can be found here:
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/viyadatamig/titlepage.htm

Juletip #7 - Migrate SAS 9.4 Latin1 to SAS Viya4 UTF-8 encoding

SAS 9.4 Latin1 encoding to SAS Viya4 UTF-8 encoding

Abstract

Introduction

Internationalization – i18n

Examples of Latin1 to UTF-8 conversion code:

SAS Code example:* adjust the PATH to something that fits your running environment.

Explanation in details of what happens:

Step 1: Define the SAS 9.4 Data Library

Step 2: Create a SAS 9.4 Data Set with Non-Default ASCII Character

Step 5: Create the UTF-8 Output Libname

Step 6: Convert Data Set Encoding

Step 7: Compare Data Sets

“Garbage” in character variable(s):

Differences when run using SAS 9.4 and Latin1 encoding:

How to fix discrepancy of character functions?

Summary:

Ready to join fellow brilliant minds for the SAS Hackathon?

SAS Code example:
* adjust the PATH to something that fits your running environment.