In a different thread, @RichardDeVen proposed I launch SAS from the SAS 9.4 (Unicode Support) icon, instead of what I have always used which is the SAS 9.4 (English) icon.
What are the advantages and disadvantages of using each?
The difference is what value is set for the system option ENCODING.
The default meaning of those START menu choices is that "English" using a single byte encoding, WLATIN1, and "Unicode Support" uses a multiple byte encoding, UTF-8.
With a single byte encoding strings can only represent 256 possible characters. With multi-byte encoding strings can represent thousands of different characters. If you try to read in data from a file or remote database that is using non 7-bit ASCII characters then with a single byte encoding some characters might not be able to represented.
So if you EVER have to deal with strings that have characters beyond the old 7-bit ASCII codes you are better off using "unicode support". The trade off is that your string handling code needs to not treat one byte of the string as the same as one character of the string. So for string manipulations you might need to use the K... series of function, KLENGTH() , KSUBSTR(), KTRANSLATE() etc.
Unicode (UTF-8) is the encoding supported by most applications we use these days. It can handle virtually any character data (national characters, emojis, etc.) I've been running SAS exclusively with ENCODING=UTF8 for years. It reduces the cases of the infamous "data cannot be transcoded" error that you encounter when trying to use data created in one encoding within a SAS session that has an incompatible encoding. UTF-8 essential for dealing with data coming from the internet or REST APIs hosted in software-as-a-service offerings.
SAS Viya runs using UTF-8 by default.
Because UTF-8 requires > 1 byte per character, the data sets are larger -- so that's a downside. If you know you will only ever deal with one encoding (usually WLATIN1 on Windows or LATIN9 on Unix) then maybe you can delay the transition.
@ChrisHemedinger wrote:
Because UTF-8 requires > 1 byte per character, the data sets are larger -- so that's a downside. If you know you will only ever deal with one encoding (usually WLATIN1 on Windows or LATIN9 on Unix) then maybe you can delay the transition.
I think this is a big factor ... our data sets are very large and we have turned on compression for all data sets. Since I rarely (almost never except for occasional graphics output) use unicode characters above the 256 ASCII characters, I think I'll stick with SAS 9.4 English for now.
Am I correct in thinking that numeric variables in SAS data sets would use the same amount of space regardless of the choice of SAS 9.4 (English) vs SAS 9.4 (Unicode Support)?
Correct for numeric -- byte length is the same. It's also the same for most of the characters you likely use, the first 128 ASCII chars. Worth testing for your data sets to see if the size really changes much.
Also, despite the extensive notes on K* functions for dealing with characters here, I find I rarely need them. Most operations work fine without needing to change your code to accommodate UTF-8. Main thing is byte-length for character variables -- you must allocate enough space in the event these need more.
See this paper for the details, quoted here.
UTF-8 is a multibyte encoding that represents all of the characters available in Unicode. UTF-8 is backward compatible with ASCII characters, which include the letters of the English alphabet, digits, and symbols frequently used in punctuation or SAS syntax. The 128 characters that make up the ASCII character set are each represented as one byte in UTF-8.
Therefore, when the ASCII characters in your data are converted to UTF-8, the size of those characters does not change. All of the other characters available in UTF-8 require 2, 3, or 4 bytes in memory. This includes many characters that are represented with a single byte of memory in the SBCS character encodings. For more information about the encodings that are supported by SAS, see the section “Encoding for NLS” in the SAS® 9.4 National Language Support (NLS): Reference Guide.
I just tested a large database extract, the size of the resulting SAS data set was identical regardless of SAS 9.4 (English) or SAS 9.4 (Unicode Support).
@PaigeMiller wrote:
I just tested a large database extract, the size of the resulting SAS data set was identical regardless of SAS 9.4 (English) or SAS 9.4 (Unicode Support).
SAS will not automatically change the storage length for you. You need to know your data and adjust as needed.
The issue is that representing non ASCII characters will take more bytes. So a variable that is defined as 8 bytes long can hold 8 characters with a single byte encoding. But with UTF-8 encoding 8 bytes might only be long enough store 2 characters. If you never use accented characters or special symbols like Microsoft "stupid" quotes then nothing needs to change. But if you have a lot of accented characters that require two or more bytes in UTF-8 and only one byte in LATIN1 then you might need to make your character variables longer than they currently are.
1 data test; 2 length sbc $256 utf8 $1024 ; 3 sbc=collate(0,256); 4 utf8=kcvt(sbc,'latin1','utf-8'); 5 byte1=length(sbc); 6 byte2=length(utf8); 7 char2=klength(utf8); 8 put (byte: char:) (=); 9 run; byte1=256 byte2=401 char2=256
As far as I know, my databases don't have any accented characters, funny mathematical symbols or unusual quotes. So those are not a worry.
I have one additional question ... I did a test, to see if a data set created by the SAS 9.4 (Unicode Support) could be read by someone using SAS 9.4 (English) (remember, my databases probably don't contain any multi-byte characters), and I was able to verify that the database could be read and used by SAS 9.4 (English). But one test doesn't prove anything, and could there be something else in my case that might cause problems reading the SAS data sets created by SAS 9.4 (Unicode support) using SAS 9.4 (English)?
It's better if everyone in the organization uses the same encoding. Yes, CEDA (cross-environment data access) does ensure that a SAS session can read and process data that was created in a different encoding. But CEDA is slower for processing, and if you need to update the data it will need to be rewritten or else handled using encoding-aware code. It's easier to not have to think about that.
I think that UTF-8 makes things easier for the long run as eventually you will encounter situations where you need it. It's the default approach for SAS Viya and for pretty much all new SAS configurations that we're involved in establishing. The legacy encodings are necessary and important for compatibility with other systems, but I always recommend UTF-8 if you're not restricted by those.
@Tom wrote:
(...) accented characters that require two or more bytes in UTF-8 and only one byte in LATIN1 ...
This is a very important point. European users familiar with their national accented characters (which they know have ASCII codes between 128 and 255 -- in LATIN1) might not expect that switching to "Unicode Support" would cause such problems: Their Äs and Ös are now treated (as multi-byte characters) as if they were related to emojis or ancient Egyptian hieroglyphs. (Okay, there are a few similarities: Ü 😊 ...)
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.