BookmarkSubscribeRSS Feed
PaigeMiller
Diamond | Level 26

In a different thread@RichardDeVen proposed I launch SAS from the SAS 9.4 (Unicode Support) icon, instead of what I have always used which is the SAS 9.4 (English) icon.


What are the advantages and disadvantages of using each?

--
Paige Miller
9 REPLIES 9
Tom
Super User Tom
Super User

The difference is what value is set for the system option ENCODING.

The default meaning of those START menu choices is that "English" using a single byte encoding, WLATIN1, and "Unicode Support" uses a multiple byte encoding, UTF-8.

 

With a single byte encoding strings can only represent 256 possible characters.  With multi-byte encoding strings can represent thousands of different characters.  If you try to read in data from a file or remote database that is using non 7-bit ASCII characters then with a single byte encoding some characters might not be able to represented.

 

So if you EVER have to deal with strings that have characters beyond the old 7-bit ASCII codes you are better off using "unicode support".  The trade off is that your string handling code needs to not treat one byte of the string as the same as one character of the string.  So for string manipulations you might need to use the K... series of function,  KLENGTH() , KSUBSTR(), KTRANSLATE() etc.

ChrisHemedinger
Community Manager

Unicode (UTF-8) is the encoding supported by most applications we use these days. It can handle virtually any character data (national characters, emojis, etc.)  I've been running SAS exclusively with ENCODING=UTF8 for years. It reduces the cases of the infamous "data cannot be transcoded" error that you encounter when trying to use data created in one encoding within a SAS session that has an incompatible encoding. UTF-8 essential for dealing with data coming from the internet or REST APIs hosted in software-as-a-service offerings.

 

SAS Viya runs using UTF-8 by default.

 

Because UTF-8 requires > 1 byte per character, the data sets are larger -- so that's a downside. If you know you will only ever deal with one encoding (usually WLATIN1 on Windows or LATIN9 on Unix) then maybe you can delay the transition.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
PaigeMiller
Diamond | Level 26

@ChrisHemedinger wrote:

 

Because UTF-8 requires > 1 byte per character, the data sets are larger -- so that's a downside. If you know you will only ever deal with one encoding (usually WLATIN1 on Windows or LATIN9 on Unix) then maybe you can delay the transition.


 

I think this is a big factor ... our data sets are very large and we have turned on compression for all data sets. Since I rarely (almost never except for occasional graphics output) use unicode characters above the 256 ASCII characters, I think I'll stick with SAS 9.4 English for now.

 

Am I correct in thinking that numeric variables in SAS data sets would use the same amount of space regardless of the choice of SAS 9.4 (English) vs SAS 9.4 (Unicode Support)?

--
Paige Miller
ChrisHemedinger
Community Manager

Correct for numeric -- byte length is the same. It's also the same for most of the characters you likely use, the first 128 ASCII chars. Worth testing for your data sets to see if the size really changes much.

 

Also, despite the extensive notes on K* functions for dealing with characters here, I find I rarely need them. Most operations work fine without needing to change your code to accommodate UTF-8. Main thing is byte-length for character variables -- you must allocate enough space in the event these need more.

 

See this paper for the details, quoted here.

 

UTF-8 is a multibyte encoding that represents all of the characters available in Unicode. UTF-8 is backward compatible with ASCII characters, which include the letters of the English alphabet, digits, and symbols frequently used in punctuation or SAS syntax. The 128 characters that make up the ASCII character set are each represented as one byte in UTF-8.


Therefore, when the ASCII characters in your data are converted to UTF-8, the size of those characters does not change. All of the other characters available in UTF-8 require 2, 3, or 4 bytes in memory. This includes many characters that are represented with a single byte of memory in the SBCS character encodings. For more information about the encodings that are supported by SAS, see the section “Encoding for NLS” in the SAS® 9.4 National Language Support (NLS): Reference Guide.

 

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
PaigeMiller
Diamond | Level 26

I just tested a large database extract, the size of the resulting SAS data set was identical regardless of SAS 9.4 (English) or SAS 9.4 (Unicode Support).

--
Paige Miller
Tom
Super User Tom
Super User

@PaigeMiller wrote:

I just tested a large database extract, the size of the resulting SAS data set was identical regardless of SAS 9.4 (English) or SAS 9.4 (Unicode Support).


SAS will not automatically change the storage length for you.  You need to know your data and adjust as needed.

 

The issue is that representing non ASCII characters will take more bytes.  So a variable that is defined as 8 bytes long can hold 8 characters with a single byte encoding.  But with UTF-8 encoding 8 bytes might only be long enough store 2 characters.  If you never use accented characters or special symbols like Microsoft "stupid" quotes then nothing needs to change.  But if you have a lot of accented characters that require two or more bytes in UTF-8 and only one byte in LATIN1 then you might need to make your character variables longer than they currently are.

1    data test;
2      length sbc $256 utf8 $1024 ;
3      sbc=collate(0,256);
4      utf8=kcvt(sbc,'latin1','utf-8');
5      byte1=length(sbc);
6      byte2=length(utf8);
7      char2=klength(utf8);
8      put (byte: char:) (=);
9    run;

byte1=256 byte2=401 char2=256

 

PaigeMiller
Diamond | Level 26

As far as I know, my databases don't have any accented characters, funny mathematical symbols or unusual quotes. So those are not a worry.

 

I have one additional question ... I did a test, to see if a data set created by the SAS 9.4 (Unicode Support) could be read by someone using SAS 9.4 (English) (remember, my databases probably don't contain any multi-byte characters), and I was able to verify that the database could be read and used by SAS 9.4 (English). But one test doesn't prove anything, and could there be something else in my case that might cause problems reading the SAS data sets created by SAS 9.4 (Unicode support) using SAS 9.4 (English)?

--
Paige Miller
ChrisHemedinger
Community Manager

It's better if everyone in the organization uses the same encoding. Yes, CEDA (cross-environment data access) does ensure that a SAS session can read and process data that was created in a different encoding. But CEDA is slower for processing, and if you need to update the data it will need to be rewritten or else handled using encoding-aware code. It's easier to not have to think about that.

 

I think that UTF-8 makes things easier for the long run as eventually you will encounter situations where you need it. It's the default approach for SAS Viya and for pretty much all new SAS configurations that we're involved in establishing. The legacy encodings are necessary and important for compatibility with other systems, but I always recommend UTF-8 if you're not restricted by those.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!
FreelanceReinh
Jade | Level 19

@Tom wrote:
(...) accented characters that require two or more bytes in UTF-8 and only one byte in LATIN1 ...

This is a very important point. European users familiar with their national accented characters (which they know have ASCII codes between 128 and 255 -- in LATIN1) might not expect that switching to "Unicode Support" would cause such problems: Their Äs and Ös are now treated (as multi-byte characters) as if they were related to emojis or ancient Egyptian hieroglyphs. (Okay, there are a few similarities: Ü 😊 ...)

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 4692 views
  • 12 likes
  • 4 in conversation