Data Encoding Impacts in SAS Viya

3 Likes

Optimizing your Datasets as you move towards SAS Viya: The Impact of UTF-8 Encoding

When moving datasets to SAS Viya, one crucial consideration is character encoding, as it directly impacts how your data is interpreted, stored, and accessed. SAS Viya uses UTF-8 encoding by default, while SAS 9 might use other encoding types (such as LATIN1, WLATIN1) depending on your system's location or configuration. This shift to UTF-8 in Viya can affect data migration in terms of table sizes, data truncation, and the retention of indexes. Here is what you need to know about UTF-8 encoding, its impact, and how to manage it effectively in SAS Viya.

What is UTF-8, and Why Does It Matter?

UTF-8 is a universal character encoding that translates human-readable characters—including numbers and symbols—into a binary format that computers understand. Over time, multiple encoding methods emerged, creating a need for a standardized format to support seamless data exchanges across platforms and regions. UTF-8, which stands for Unicode Transformation Format 8-bit, became that standard, enabling compatibility and a wider range of characters than older encoding methods.

For SAS 9 users, data may still be encoded in other formats, like LATIN1 or WLATIN1. Moving these datasets to SAS Viya, which operates with UTF-8, may lead to changes in how tables and data structures are interpreted. For example, an indexed table in LATIN1 encoding could appear without indexes when accessed in Viya. Understanding how your data is encoded before migration can save you from surprises and ensure data integrity.

The Importance of Identifying Data Encoding

There are a couple of ways to find out the encoding of the datasets in the SAS libraries. One way is to view the information in the SASHELP.VTABLE. The SASHELP.VTABLE contains various characteristics of the datasets in a current SAS session. You can use Proc Freq to summarize the encoding types to give you an idea of how many different encoding types are in the session.

Example: Proc Freq with SASHELP.VTABLE

proc freq data=sashelp.vtable;
	table encoding / out=FreqCount outexpect sparse;
	title 'Summary of Encoding Types'
run;

You can also use Proc Contents to see the encoding on a specific dataset as well other important metadata. Either method works to determine the encoding type to enable you to make decisions on how to move the dataset to SAS Viya. The use of the SASHELP.VTABLE is more of a summary view and the use of Proc Contents is specific to a single dataset.

Suppose you are moving a LATIN1-encoded SAS dataset with indexes to SAS Viya. The first step is to confirm the encoding format by using the Proc Contents procedure, which reveals essential metadata about the dataset, including encoding type. If your SAS Compute Context is not configured to match the dataset's encoding, SAS Viya may overlook indexes or other details—critical for data accessibility and performance.

Example: A LATIN1 Encoded Dataset in SAS Viya

Uploading a LATIN1 encoded SAS dataset to a mounted drive accessible by SAS Viya might seem straightforward. However, upon running Proc Contents, you may find discrepancies in the metadata. For instance, the index, usually shown at the bottom of the variable list, might not appear, indicating the need to convert the encoding to UTF-8.

Converting Data to UTF-8: The Solution

To resolve encoding mismatches, convert your data to UTF-8 using the Proc Migrate procedure. Proc Migrate not only updates the data to UTF-8 encoding but also retains essential elements like indexes and catalogs. Here is how it works:

Run Proc Migrate on the SAS library containing the dataset.
Execute Proc Contents again to verify the conversion.
You should now see the dataset in UTF-8 encoding with indexes intact.

proc migrate in=lat2 out=lat1;
run;

proc contents data=lat1.carsind;
run;

Key Takeaways for a Smooth Migration

When preparing data for SAS Viya, follow these steps to ensure a smooth migration:

**Understand Data Encoding**: Use Proc Contents to check the current encoding and identify if there is a need for conversion.
**Choose the Right Conversion Method**: Proc Migrate is the preferred tool for migrating datasets to UTF-8 as it preserves indexes, catalogs, and other essential metadata.
**Avoid Data Truncation**: If not migrated, data saved in Viya after accessing will default to UTF-8, which could truncate data. The CVP engine can pad data during this process to prevent truncation.
**Ensure Index Integrity**: Moving data without using Proc Migrate might strip indexes and catalogs, resulting in incomplete metadata. Converting data with Proc Migrate maintains these elements for efficient querying and analysis.

Transitioning to UTF-8 encoding in SAS Viya can be challenging but there are options to work with to smooth the process. By understanding your data’s encoding and leveraging tools like Proc Migrate, you will ensure that your datasets are accurately represented, complete with indexes, and ready for efficient use in SAS Viya.

Additional resources regarding moving datasets to SAS Viya:

SAS Help Center: Migrating Data to UTF-8 Encoding

SAS Help Center: Migrating with Direct Access and No Incompatible Catalogs

The SAS Encoding Journey: A Byte at a Time

ronan · ‎11-22-2024

@LouGalway_sas Thanks for sharing these important considerations which are, from a european-based perspective especially relevant since accented letters (à la carte dishes, née Smith maiden name, piña colada cocktail recipe etc.) or diacritics are ubiquitous in non-english languages (German umlaut, French cedille, Spanish tildes, Romanian ă/î to name only a few: every european written language might use such modified latin letters , some very commonly).

SAS treats differently accented letters whether you run an ASCII/ISO encoding session or a UTF8 session. The following page explains it all : https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/nlsref/p1pca7vwjjwucin178l8qddjn0gi.htmTherefore migrating data sets from ASCII charsets (Latin1, Latin9, WLATIN1 ) in addition to " affect[ing] data migration in terms of table sizes, data truncation, and the retention of indexes", also directly affects plain SAS code : baseline syntax must be modified to take into account the presence of accented characters in CHAR variables, by removing accents altogether (if such alterations are acceptable) or replacing string functions with their k-counterparts.

This is very challenging indeed ! Perhaps, extending the native toolbox of SAS might help transitioning towards UTF8 : providing an extended attribute to store in the descriptor portion of the Data Set the binary flag 1/0 (1= at least one CHAR var. with multi-bytes length) , sparing k-functions altogether (generalizing VARCHAR type with V9 engine might help greatly in this regard), providing NLS procs to detect in bulk the need for so-called internationalization (proc WhereIsMyMultiBytesVar data= ...) etc.

Data Encoding Impacts in SAS Viya

SAS Innovate 2025: Register Now

Free course: Data Literacy Essentials

Get Started