The Wild World of Character Encoding, and How it Inconvenienced Me

3 Likes

In this post you’re going to learn way too much about character encoding. You’ll see how it pertains to SAS, what it even is, and much more. All of this was inspired by an interesting change I found at the tail end of creating the Implementing SAS Grid 9.4M9 workshop (coming soon!!), which we’ll also go over if we have time (we do).

Character Encoding: What is it?

Character encoding is a very simple concept. It’s the way in which we transcribe characters (abc, 123, *&%, (づ｡◕‿‿◕｡)づ, etc) into formats a computer can understand. Basically computers only understand 0/1 or on/off and we’d like them to know about things like the letter “A”, so we have to “encode” that information in some way in memory.

How Do We “Encode” These “Characters”?

Short answer: text encoding standards.

Here’s a short list of ones you might’ve heard of:

ASCII
Latin-1
UTF-8

Here’s a longer list of ones you probably haven’t heard of:

KOI8-R
IBM EBCDIC
UTF-32
MacRoman
Shift JIS
Big5

ASCII is a classic, quite literally, and it’s the easiest to show in a chart because it fits neatly onto one. This is unlike the more complicated text encoding standards we’ll get to in a moment.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

As you can see, each character on the keyboard has a 7 bit value associated with it to allow the computer to comprehend alphanumerics (and beyond). ASCII utilizes every possible 7 bit permutation. It’s simple, it’s elegant, and it’s unprepared for the realities of this cruel world. Add other latin-based languages into the mix and we get pesky accented letters and a whole host of other complications. To remedy this, we got Latin-1! Latin-1 character encoding is 1 byte (8 bits) long; this single extra bit over ASCII allows us to double the amount of encodable characters (from 128 in ASCII to 256 in Latin-1)

Here’s another chart, this time just showing a handful of the new possible characters in Latin-1:

As we got more expressive, brought in more languages, and looked for universal standardization, UTF-8 became the answer (and still is the widespread popular text encoding standard to this day!). UTF-8 has what’s called “variable encoding” which means it can represent all of ASCII using those 7 bits, but can scale up to as much as 32 bits depending on what it has to represent.

As an example: this “☝️” emoji is expressed as “11100010 10011000 10011101” in UTF-8. UTF-8 scales up to 24 bits in this case to accommodate the strange new character. We’ve got backwards compatibility here with ASCII, but Latin-1 is not so lucky. Latin-1 uses that extra 8th bit at the end of the byte to express 128 new characters, but UTF-8 needs it to express a multi-byte sequence. A special character like the accented Ñ can be expressed by both text encoding standards, but the encoding just looks different. This means moving a document to use Latin-1 to UTF-8 and vice versa can be messy if you’ve got special characters.

What About Those Other Methods of Text Encoding You Listed?

Don’t worry about them. There’s a ton of different standards that have existed throughout modern history. Some are for other languages like KOI8-R (Russian). Some are for specific platforms and were developed independently like MacRoman (for old Mac machines) and IBM EBCDIC (IBM’s precursor to ASCII). I want to focus on Latin-1 and UTF-8 because they see a lot of use and they segue nicely into my next section...

How Does This Concern SAS 9.4 Grid?

I've been creating a workshop for Implementing SAS Grid Manager on SAS 9.4M8 and SAS 9.4M9 now that that's out. Going from SAS 9.4M8 to SAS 9.4M9 in the Implementing SAS Grid Manager workshop exercises was easy! Or it would’ve been if not for a single hitch in the process. This particular obstacle led me down a road of questions which turned into this very post, and I believe I’ve found my answer as to what’s happening.

Originally, I believed the error I got was due to a bug, but I’m now convinced that it’s merely a change. I’ll walk you through my process and show what I ran into along the way.

First, choosing the encoding method for SAS Grid to use:

During initial deployment you’re given the option to “Configure as a Unicode server” every time you run through the deployment steps within a given tier (In this environment we have 3 tiers: Metadata/Compute Tier, Middle Tier, and Client Tier). It’s important all your deployed tiers have the same option selected, and historically we’ve left this option unchecked as our grid workshop tries to follow the defaults wherever possible for the sake of simplicity.

Leaving the box unchecked gives you Latin-1 text encoding for your SAS sessions, though you wouldn’t necessarily know it by looking at the paragraph above the checkbox. Latin-1 is ordinarily sufficient for the purposes of running SAS code in English, and it has the added benefit of utilizing less memory. If I had to guess, this small benefit is likely why it’s unchecked by default. SAS is all about performance, and that’d include reading and encoding data as well as optimizing the storage of data. It could also be a holdover from legacy systems. SAS along with other software companies in the United States and other Latin-language-based nations would’ve begun standardizing on Latin-1 for their text encoding around the 1990s. Like the deployment wizard says: you can potentially miss out on the ability to share SAS data with others who use default encodings. Nowadays you should almost always standardize to UTF-8, so any new deployers should be checking this box almost universally... but I didn’t.

Next, the problem:

The deployment proceeded as normal. The post-deployment proceeded as normal. Even most of the client exercises section of the exercises proceeded as normal. There were a few improvements in M9, which I got to take advantage of, but no hitches or issues. Unfortunately, once I hit the tests for our Data Integration Studio exercise I finally ran into a problem. This is what it looks like when you run our project in SAS 9.4M8:

Lots of satisfying green “completed successfully” checkmarks.

Here’s what happens when we run the same project in SAS 9.4M9:

Okay! Not a problem. We can just do as the error says and check our encoding= and locale= SAS system options. This is what we get for both in SAS 9.4M8 and in SAS 9.4M9:

So... it’s not that. We get ENCODING=LATIN1 and LOCALE=EN_US as expected (considering the values we entered in the initial deployment)

The chase for answers:

I spent a couple of hours tagging colleagues, reviewing system options, and browsing support tickets. In particular, I noted this interesting post linked on one support ticket about a man named Oliver who lost his last name. It wasn’t looking good for me, as I really didn’t want to have to comb through all the project files, the dataset, and the metadata to find a “weird” non-Latin-1 character that must be causing all this headache. Eventually though, it had to be done.

The quest for the weird character:

I pulled all the project files out onto the windows client and decided to use Windows Powershell scripts instead of checking everything out by hand. This is what the script looked like:

# folder containing the XML files

$folderPath = 'C:\Users\student\Documents\Package1'

# get all XML files in the folder

$files = Get-ChildItem -Path $folderPath -Filter *.xml

# make a found flag

$found = $false

# Loop through each file

foreach ($file in $files) {

# Read entire file as a single string $content =

[System.IO.File]::ReadAllText($file.FullName)

for ($i = 0; $i -lt $content.Length; $i++) {

$char = $content[$i]

$code = [int][char]$char

if ($code -gt 255) {

$found = $true

Write-Host ("weird char in {0} at position {1}: '{2}' (code {3})" -f $file.Name, $i,

$char, $code)

}

# success message if no weird characters are found

if (-not $found) {

Write-Host "no weird characters found"

}

I ran it once to check for characters with values greater than 255 (beyond Latin-1's capabilities) and once to check for characters with values greater than 127 (which could propose a conversion risk from Latin-1 to UTF-8).

For both runs I got... drumroll please.......

weird char in PackageMetaData at position 5: '�' (code 65533)

weird char in PackageMetaData at position 6: '�' (code 65533)

weird char in PackageMetaData at position 8: 'ߑ' (code 2001)

weird char in PackageMetaData at position 9: '�' (code 65533)

weird char in PackageMetaData at position 13: '�' (code 65533)

weird char in PackageMetaData at position 17: '�' (code 65533)

weird char in PackageMetaData at position 22: '�' (code 65533)

So there’s our culprit! Here’s what PackageMetaData looks like in the first 22 columns:

Despite how it looks, plenty of whitespace characters occupy everything from the “A” all the way to the trademark symbol, making the “T” in “This” the 23rd column of the file.

Where SAS 9.4M8 might’ve silently tried to replace the abnormal characters, SAS 9.4M9 tells us explicitly that there is an error in need of correcting. I do wish it told me where the error was coming from... but that metadata info isn’t something that’s readily available to normal users from the open project window, so I can understand the hesitation.

(When I searched the SAS documentation, I could not find any references relating to a change about the stricter enforcement of text encoding discrepancies. I'm just making an assumption based on what I see)

This test project “Drive_Extract_EmpInfo” was at one point probably made in an environment using UTF-8 encoding. We can guess this from the garbled mess of characters we discovered, but we can also use a more readable clue:

One of the first things I noticed when snooping around the extracted project files was the presence of this UTF-8 encoding specification in the header of each XML file. I don’t think this by itself is enough for SAS to throw an error, but it does support my theory of the project being created in a UTF-8 environment. Those garbled characters could’ve appeared because of copy paste issues or they may have been intentionally placed there. One can hope the original author had a good reason and wasn’t out to maliciously slight me from years in the past, but it’s always a possibility.

Regardless -- I’ll be checking this box from now on! (╯°□°）╯︵ ┻━┻

Related Links

A transcoding story (or, How Oliver S. Füßling lost his last name and comes to find it again) - SAS ...

The SAS® Encoding Journey: A Byte at a Time <-- very cool read if you want more in-depth text encoding information

Knowledge Article View - Customer Support

Find more articles from SAS Global Enablement and Learning here.

SASKiwi · ‎10-11-2025

Thanks so much for the interesting read!

I have an interesting encoding story too. Where I work we are in the middle of a major SAS 9.4 upgrade project (M2 to M8 both running on Windows Server). The M2 encoding is Latin1, but during the M8 design phase it was recommended we use UTF-8 to future proof us for handling multiple languages. So far so good. It wasn't apparent to me until our M8 installation was complete and we started to migrate data from our old platforms that changing the encoding default means you have to rewrite all SAS datasets using SAS functionality instead of just using much faster OS copy functionality. That means migrating SAS data instead of taking a few days to copy, it now takes weeks (we are talking multiple platforms and many terabytes here). Yes you can do it with the likes of PROC DATASETS but it sure is slow! This is not the journey I anticipated with a simple decision to use a different encoding standard.

AndersS · ‎10-21-2025

Hi! Very good and interesting.
The next step is to understand a sentense like ”Å i åa ä e ö”
- Please note the definite article "a".

This is a local Swedish language. "And in the river there is an island".
I quoted this many years ago to show SAS Institute developers that
"Yes! We do use National characters a lot".
The support for National characters have been VERY WELL implemented by SAS Institute. /Br Anders Sköllermo - Skollermo in English