Data Encoding Impacts in SAS Viya

3 Likes

Optimizing your Datasets as you move towards SAS Viya: The Impact of UTF-8 Encoding

When moving datasets to SAS Viya, one crucial consideration is character encoding, as it directly impacts how your data is interpreted, stored, and accessed. SAS Viya uses UTF-8 encoding by default, while SAS 9 might use other encoding types (such as LATIN1, WLATIN1) depending on your system's location or configuration. This shift to UTF-8 in Viya can affect data migration in terms of table sizes, data truncation, and the retention of indexes. Here is what you need to know about UTF-8 encoding, its impact, and how to manage it effectively in SAS Viya.

What is UTF-8, and Why Does It Matter?

UTF-8 is a universal character encoding that translates human-readable characters—including numbers and symbols—into a binary format that computers understand. Over time, multiple encoding methods emerged, creating a need for a standardized format to support seamless data exchanges across platforms and regions. UTF-8, which stands for Unicode Transformation Format 8-bit, became that standard, enabling compatibility and a wider range of characters than older encoding methods.

For SAS 9 users, data may still be encoded in other formats, like LATIN1 or WLATIN1. Moving these datasets to SAS Viya, which operates with UTF-8, may lead to changes in how tables and data structures are interpreted. For example, an indexed table in LATIN1 encoding could appear without indexes when accessed in Viya. Understanding how your data is encoded before migration can save you from surprises and ensure data integrity.

The Importance of Identifying Data Encoding

There are a couple of ways to find out the encoding of the datasets in the SAS libraries. One way is to view the information in the SASHELP.VTABLE. The SASHELP.VTABLE contains various characteristics of the datasets in a current SAS session. You can use Proc Freq to summarize the encoding types to give you an idea of how many different encoding types are in the session.

Example: Proc Freq with SASHELP.VTABLE

proc freq data=sashelp.vtable;
	table encoding / out=FreqCount outexpect sparse;
	title 'Summary of Encoding Types'
run;

You can also use Proc Contents to see the encoding on a specific dataset as well other important metadata. Either method works to determine the encoding type to enable you to make decisions on how to move the dataset to SAS Viya. The use of the SASHELP.VTABLE is more of a summary view and the use of Proc Contents is specific to a single dataset.

Suppose you are moving a LATIN1-encoded SAS dataset with indexes to SAS Viya. The first step is to confirm the encoding format by using the Proc Contents procedure, which reveals essential metadata about the dataset, including encoding type. If your SAS Compute Context is not configured to match the dataset's encoding, SAS Viya may overlook indexes or other details—critical for data accessibility and performance.

Example: A LATIN1 Encoded Dataset in SAS Viya

Uploading a LATIN1 encoded SAS dataset to a mounted drive accessible by SAS Viya might seem straightforward. However, upon running Proc Contents, you may find discrepancies in the metadata. For instance, the index, usually shown at the bottom of the variable list, might not appear, indicating the need to convert the encoding to UTF-8.

Converting Data to UTF-8: The Solution

To resolve encoding mismatches, convert your data to UTF-8 using the Proc Migrate procedure. Proc Migrate not only updates the data to UTF-8 encoding but also retains essential elements like indexes and catalogs. Here is how it works:

Run Proc Migrate on the SAS library containing the dataset.
Execute Proc Contents again to verify the conversion.
You should now see the dataset in UTF-8 encoding with indexes intact.

proc migrate in=lat2 out=lat1;
run;

proc contents data=lat1.carsind;
run;

Key Takeaways for a Smooth Migration

When preparing data for SAS Viya, follow these steps to ensure a smooth migration:

**Understand Data Encoding**: Use Proc Contents to check the current encoding and identify if there is a need for conversion.
**Choose the Right Conversion Method**: Proc Migrate is the preferred tool for migrating datasets to UTF-8 as it preserves indexes, catalogs, and other essential metadata.
**Avoid Data Truncation**: If not migrated, data saved in Viya after accessing will default to UTF-8, which could truncate data. The CVP engine can pad data during this process to prevent truncation.
**Ensure Index Integrity**: Moving data without using Proc Migrate might strip indexes and catalogs, resulting in incomplete metadata. Converting data with Proc Migrate maintains these elements for efficient querying and analysis.

Transitioning to UTF-8 encoding in SAS Viya can be challenging but there are options to work with to smooth the process. By understanding your data’s encoding and leveraging tools like Proc Migrate, you will ensure that your datasets are accurately represented, complete with indexes, and ready for efficient use in SAS Viya.

Additional resources regarding moving datasets to SAS Viya:

SAS Help Center: Migrating Data to UTF-8 Encoding

SAS Help Center: Migrating with Direct Access and No Incompatible Catalogs

The SAS Encoding Journey: A Byte at a Time

ronan · ‎11-22-2024

@LouGalway_sas Thanks for sharing these important considerations which are, from a european-based perspective especially relevant since accented letters (à la carte dishes, née Smith maiden name, piña colada cocktail recipe etc.) or diacritics are ubiquitous in non-english languages (German umlaut, French cedille, Spanish tildes, Romanian ă/î to name only a few: every european written language might use such modified latin letters , some very commonly).

SAS treats differently accented letters whether you run an ASCII/ISO encoding session or a UTF8 session. The following page explains it all : https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/nlsref/p1pca7vwjjwucin178l8qddjn0gi.htmTherefore migrating data sets from ASCII charsets (Latin1, Latin9, WLATIN1 ) in addition to " affect[ing] data migration in terms of table sizes, data truncation, and the retention of indexes", also directly affects plain SAS code : baseline syntax must be modified to take into account the presence of accented characters in CHAR variables, by removing accents altogether (if such alterations are acceptable) or replacing string functions with their k-counterparts.

This is very challenging indeed ! Perhaps, extending the native toolbox of SAS might help transitioning towards UTF8 : providing an extended attribute to store in the descriptor portion of the Data Set the binary flag 1/0 (1= at least one CHAR var. with multi-bytes length) , sparing k-functions altogether (generalizing VARCHAR type with V9 engine might help greatly in this regard), providing NLS procs to detect in bulk the need for so-called internationalization (proc WhereIsMyMultiBytesVar data= ...) etc.

Massimo_Fabris · ‎03-05-2025

Hi @ronan , you are right. The need of using k-functions is often underlooked.

I also stress that using proc migrate is fine but does not solve the potential truncation problem.

From current SAS Documentation (2025.02):
"When you migrate a data set to an encoding where the characters are represented by more bytes, truncation might occur if the column length does not accommodate the larger character size. For example, a character might be represented in Wlatin1 encoding as one byte but in UTF-8 as two bytes. The best solution is to expand the column length with the CVP engine and PROC COPY before you migrate. (PROC MIGRATE does not currently support the CVP engine.) The CVPMULTIPLIER=2.5 value is usually sufficient to avoid truncation. If your data contains Asian characters, CVPMULTIPLIER=4 is recommended. "

The macro %COPY_TO_NEW_ENCODING perform a perfect conversion by augmenting just the char variables that need to be augmented, but it is very heavy to run.

The %COPY_TO_NEW_ENCODING macro creates a new version of a data set with a specified encoding.

If the data set contains character variables whose values need larger lengths when transcoded to the specified encoding, then the DATA step creates a data set with the proper lengths.

If the %COPY_TO_NEW_ENCODING macro is not used, the copy might fail because truncation of non-blanks is not allowed.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/nlsref/p1g1d26os4w0von1cdfh827foo3r.htm

ronan · ‎03-07-2025

Hi @Massimo_Fabris ,

Thanks for completing my post and sharing these important conversion tools ... and their limitations, based on practice. I agree with you, this tooling though readily available is far from perfect. I think we are missing important built-in features, like procedures, engines or variable types to properly travel the conversion to UTF-8 journey. Contrary to some official guidance, I do not recommend my (French) customers to embrace UTF8 unequivocally - utf8 being provided by default (SPRE, Compute) or mandatory (CAS, Parquet) in Viya, but rather selectively. As we can see in the example below, CAS engine running a Data Step treats differently CHAR and VARCHAR variables as regards length units , in bytes or in characters with the former, exactly like SBCS encoding sessions do :

Index CHAR and VARCHAR Character Strings

Latest stable build gives a hint this is a favored direction : Critical Change: Width for VARCHAR Formats

Extending VARCHAR type to V9 Engine might then provide an elegant solution to the UTF8 issue, perhaps.

Massimo_Fabris · ‎03-07-2025

Hi @ronan ,
thank you for posting this very interesting documentation about CHAR and VARCHAR.

I need to study on this new stuff!

Data Encoding Impacts in SAS Viya

Registration is open

SAS AI and Machine Learning Courses