Optimizing your Datasets as you move towards SAS Viya: The Impact of UTF-8 Encoding
When moving datasets to SAS Viya, one crucial consideration is character encoding, as it directly impacts how your data is interpreted, stored, and accessed. SAS Viya uses UTF-8 encoding by default, while SAS 9 might use other encoding types (such as LATIN1, WLATIN1) depending on your system's location or configuration. This shift to UTF-8 in Viya can affect data migration in terms of table sizes, data truncation, and the retention of indexes. Here is what you need to know about UTF-8 encoding, its impact, and how to manage it effectively in SAS Viya.
What is UTF-8, and Why Does It Matter?
UTF-8 is a universal character encoding that translates human-readable characters—including numbers and symbols—into a binary format that computers understand. Over time, multiple encoding methods emerged, creating a need for a standardized format to support seamless data exchanges across platforms and regions. UTF-8, which stands for Unicode Transformation Format 8-bit, became that standard, enabling compatibility and a wider range of characters than older encoding methods.
For SAS 9 users, data may still be encoded in other formats, like LATIN1 or WLATIN1. Moving these datasets to SAS Viya, which operates with UTF-8, may lead to changes in how tables and data structures are interpreted. For example, an indexed table in LATIN1 encoding could appear without indexes when accessed in Viya. Understanding how your data is encoded before migration can save you from surprises and ensure data integrity.
The Importance of Identifying Data Encoding
There are a couple of ways to find out the encoding of the datasets in the SAS libraries. One way is to view the information in the SASHELP.VTABLE. The SASHELP.VTABLE contains various characteristics of the datasets in a current SAS session. You can use Proc Freq to summarize the encoding types to give you an idea of how many different encoding types are in the session.
Example: Proc Freq with SASHELP.VTABLE
proc freq data=sashelp.vtable;
table encoding / out=FreqCount outexpect sparse;
title 'Summary of Encoding Types'
run;
You can also use Proc Contents to see the encoding on a specific dataset as well other important metadata. Either method works to determine the encoding type to enable you to make decisions on how to move the dataset to SAS Viya. The use of the SASHELP.VTABLE is more of a summary view and the use of Proc Contents is specific to a single dataset.
Suppose you are moving a LATIN1-encoded SAS dataset with indexes to SAS Viya. The first step is to confirm the encoding format by using the Proc Contents procedure, which reveals essential metadata about the dataset, including encoding type. If your SAS Compute Context is not configured to match the dataset's encoding, SAS Viya may overlook indexes or other details—critical for data accessibility and performance.
Example: A LATIN1 Encoded Dataset in SAS Viya
Uploading a LATIN1 encoded SAS dataset to a mounted drive accessible by SAS Viya might seem straightforward. However, upon running Proc Contents, you may find discrepancies in the metadata. For instance, the index, usually shown at the bottom of the variable list, might not appear, indicating the need to convert the encoding to UTF-8.
Converting Data to UTF-8: The Solution
To resolve encoding mismatches, convert your data to UTF-8 using the Proc Migrate procedure. Proc Migrate not only updates the data to UTF-8 encoding but also retains essential elements like indexes and catalogs. Here is how it works:
Run Proc Migrate on the SAS library containing the dataset.
Execute Proc Contents again to verify the conversion.
You should now see the dataset in UTF-8 encoding with indexes intact.
proc migrate in=lat2 out=lat1;
run;
proc contents data=lat1.carsind;
run;
Key Takeaways for a Smooth Migration
When preparing data for SAS Viya, follow these steps to ensure a smooth migration:
**Understand Data Encoding**: Use Proc Contents to check the current encoding and identify if there is a need for conversion.
**Choose the Right Conversion Method**: Proc Migrate is the preferred tool for migrating datasets to UTF-8 as it preserves indexes, catalogs, and other essential metadata.
**Avoid Data Truncation**: If not migrated, data saved in Viya after accessing will default to UTF-8, which could truncate data. The CVP engine can pad data during this process to prevent truncation.
**Ensure Index Integrity**: Moving data without using Proc Migrate might strip indexes and catalogs, resulting in incomplete metadata. Converting data with Proc Migrate maintains these elements for efficient querying and analysis.
Transitioning to UTF-8 encoding in SAS Viya can be challenging but there are options to work with to smooth the process. By understanding your data’s encoding and leveraging tools like Proc Migrate, you will ensure that your datasets are accurately represented, complete with indexes, and ready for efficient use in SAS Viya.
Additional resources regarding moving datasets to SAS Viya:
SAS Help Center: Migrating Data to UTF-8 Encoding
SAS Help Center: Migrating with Direct Access and No Incompatible Catalogs
The SAS Encoding Journey: A Byte at a Time
... View more