Optimizing your Datasets as you move towards SAS Viya: The Impact of UTF-8 Encoding
When moving datasets to SAS Viya, one crucial consideration is character encoding, as it directly impacts how your data is interpreted, stored, and accessed. SAS Viya uses UTF-8 encoding by default, while SAS 9 might use other encoding types (such as LATIN1, WLATIN1) depending on your system's location or configuration. This shift to UTF-8 in Viya can affect data migration in terms of table sizes, data truncation, and the retention of indexes. Here is what you need to know about UTF-8 encoding, its impact, and how to manage it effectively in SAS Viya.
What is UTF-8, and Why Does It Matter?
UTF-8 is a universal character encoding that translates human-readable characters—including numbers and symbols—into a binary format that computers understand. Over time, multiple encoding methods emerged, creating a need for a standardized format to support seamless data exchanges across platforms and regions. UTF-8, which stands for Unicode Transformation Format 8-bit, became that standard, enabling compatibility and a wider range of characters than older encoding methods.
For SAS 9 users, data may still be encoded in other formats, like LATIN1 or WLATIN1. Moving these datasets to SAS Viya, which operates with UTF-8, may lead to changes in how tables and data structures are interpreted. For example, an indexed table in LATIN1 encoding could appear without indexes when accessed in Viya. Understanding how your data is encoded before migration can save you from surprises and ensure data integrity.
The Importance of Identifying Data Encoding
There are a couple of ways to find out the encoding of the datasets in the SAS libraries. One way is to view the information in the SASHELP.VTABLE. The SASHELP.VTABLE contains various characteristics of the datasets in a current SAS session. You can use Proc Freq to summarize the encoding types to give you an idea of how many different encoding types are in the session.
Example: Proc Freq with SASHELP.VTABLE
proc freq data=sashelp.vtable;
table encoding / out=FreqCount outexpect sparse;
title 'Summary of Encoding Types'
run;
You can also use Proc Contents to see the encoding on a specific dataset as well other important metadata. Either method works to determine the encoding type to enable you to make decisions on how to move the dataset to SAS Viya. The use of the SASHELP.VTABLE is more of a summary view and the use of Proc Contents is specific to a single dataset.
Suppose you are moving a LATIN1-encoded SAS dataset with indexes to SAS Viya. The first step is to confirm the encoding format by using the Proc Contents procedure, which reveals essential metadata about the dataset, including encoding type. If your SAS Compute Context is not configured to match the dataset's encoding, SAS Viya may overlook indexes or other details—critical for data accessibility and performance.
Example: A LATIN1 Encoded Dataset in SAS Viya
Uploading a LATIN1 encoded SAS dataset to a mounted drive accessible by SAS Viya might seem straightforward. However, upon running Proc Contents, you may find discrepancies in the metadata. For instance, the index, usually shown at the bottom of the variable list, might not appear, indicating the need to convert the encoding to UTF-8.
Converting Data to UTF-8: The Solution
To resolve encoding mismatches, convert your data to UTF-8 using the Proc Migrate procedure. Proc Migrate not only updates the data to UTF-8 encoding but also retains essential elements like indexes and catalogs. Here is how it works:
proc migrate in=lat2 out=lat1;
run;
proc contents data=lat1.carsind;
run;
Key Takeaways for a Smooth Migration
When preparing data for SAS Viya, follow these steps to ensure a smooth migration:
Transitioning to UTF-8 encoding in SAS Viya can be challenging but there are options to work with to smooth the process. By understanding your data’s encoding and leveraging tools like Proc Migrate, you will ensure that your datasets are accurately represented, complete with indexes, and ready for efficient use in SAS Viya.
Additional resources regarding moving datasets to SAS Viya:
SAS Help Center: Migrating Data to UTF-8 Encoding
SAS Help Center: Migrating with Direct Access and No Incompatible Catalogs
The SAS Encoding Journey: A Byte at a Time
@LouGalway_sas Thanks for sharing these important considerations which are, from a european-based perspective especially relevant since accented letters (à la carte dishes, née Smith maiden name, piña colada cocktail recipe etc.) or diacritics are ubiquitous in non-english languages (German umlaut, French cedille, Spanish tildes, Romanian ă/î to name only a few: every european written language might use such modified latin letters , some very commonly).
SAS treats differently accented letters whether you run an ASCII/ISO encoding session or a UTF8 session. The following page explains it all : https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/nlsref/p1pca7vwjjwucin178l8qddjn0gi.htmTherefore migrating data sets from ASCII charsets (Latin1, Latin9, WLATIN1 ) in addition to " affect[ing] data migration in terms of table sizes, data truncation, and the retention of indexes", also directly affects plain SAS code : baseline syntax must be modified to take into account the presence of accented characters in CHAR variables, by removing accents altogether (if such alterations are acceptable) or replacing string functions with their k-counterparts.
This is very challenging indeed ! Perhaps, extending the native toolbox of SAS might help transitioning towards UTF8 : providing an extended attribute to store in the descriptor portion of the Data Set the binary flag 1/0 (1= at least one CHAR var. with multi-bytes length) , sparing k-functions altogether (generalizing VARCHAR type with V9 engine might help greatly in this regard), providing NLS procs to detect in bulk the need for so-called internationalization (proc WhereIsMyMultiBytesVar data= ...) etc.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.