Solaris | Encoding

MariaD · Posted 07-15-2020 11:12 AM

Hi folks,

We are using a new SAS environment on Solaris. Our users have some folders and tables name with latin characters (for example: "ó").

Under PuTTY we verify the folder exists and is as expected: "/usuarios/nuevo_usuario/Histórico"

On SAS, using SAS EG 7.15, we define a libname with the mentioned path. According log information, the libname was correctly assigned to the defined path.

26         libname test '/usuarios/nuevo_usuario/Histórico';
NOTE: Libref TEST was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /usuarios/nuevo_usuario/Histórico

28         data test.sample;
29           set sashelp.class;
30         run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set TEST.SAMPLE has 19 observations and 5 variables.
NOTE: Compressing data set TEST.SAMPLE increased size by 100.00 percent. 
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

We execute a simple data step to create a table. According the log, again, everything is fine. But verifying the server using PuTTY, we realise the table as created inside a different path: /usuarios/nuevo_usuario/HistÃ³rico

As you can see, the "ó" character was converter from Latin to UTF. There is any way to prevent it? Any SAS configuration or Solaris configuration?

Regards,

LeonidBatkhan · Posted 07-15-2020 03:14 PM

Hi MariaD,

Check your sasv9.cfg file ( located somewhere at ...SASHome/SASFoundation/9.4/nls/en ) and make sure it has the following encoding set:

-ENCODING wlatin1

Hope this helps.

➤ Leonid's SAS blog

MariaD · Posted 07-15-2020 03:44 PM

Thanks @LeonidBatkhan , I'll try it and let you know. I need to restart SAS services after the change? I believe that it's not necessary.

Kurt_Bremser · Posted 07-16-2020 08:27 AM

As @ChrisNZ noted, my native language is German, and we do have the umlauts (Ä,Ö,Ü) and the so-called "sharp s" ß. Before those letters even appeared on mechanical devices (like IBM typewriters with German heads), it was common to translate those to ae, oe, ue and ss, respectively. When I started in IT, no computer knew those letters, so the translation was even used in official documents, if those were printed by computers.

Nowadays, users are tempted to use these characters in filenames or elsewhere in a technical context.

But see:

The problem of native language characters was approached in stages; first, by using some bytes in the ASCII table for several purposes in different parts of the world (i.e., square brackets were replaced by Ä and Ü in certain German codepages, as square brackets are rarely used in German). This was done to keep using only 7 bits for a character (look into manuals of older terminals to understand this). Next, the "upper half" (most significant bit = 1) of the ASCII table was used, which can be found in the codepages of UNIX systems and Windows, which accommodated most of the Western European orthography. But with the need to use completely different writing systems (Cyrillic, Arabic, Hebraic, Korean, Japanese, Chinese), a much larger table of characters/symbols was implemented with the UTF system. The problem now is that the UTF characters need "entry points" from the single-byte character table, so they be identified as UTF; but the bit sequences of those "entry points" are already used in certain codepages for displayable characters, and if a data stream is not explicitly marked as UTF, lots of systems misunderstand the two-, three- or four-byte sequence of a UTF character for the usual characters associated with these 2/3/4 codes, resulting in hilarious (or not so) hilarious results.

While desktop systems have learned to deal with this, there are areas in computing where UTF use can have serious consequences. Consider the fact that operating systems (including Windows!) have a hard limit of ~200 characters (bytes) for path names. A fully Chinese filename would only need 50 or so UTF characters/symbols to exceed that.

There are other problems: as I experienced, there are now (at least) two single quotes available: the standard one, as also used in code, and a UTF one. That caused the "fun" with my MP3 files. Text writing software will auto-replace characters (Word, in German, will auto-replace double quotes with the "leading" and "trailing" quotes used in German, or it will use "beautified" (slightly curly) quotes; copy/pasting code with those will inevitably cause syntax errors).

Another serious issue that happened to me: I used the letter "Ä" in a password, to follow the "at least one special character" rule. One computer sent the character from the Wlatin codepage, the other the UTF equivalent. Consequence: a locked user.

Bottom line: on the technical side (names in code, filenames, server names, user ID), stay with the basic English-derived lettering system, and follow my previous rules; especially, avoid the use of blanks, and replace those with underlines to separate words in file or variable names. Put fancy text into labels. A label might display "funny" on a remote system, but it won't crash your program, or cause files to be un-openable.

If you follow these rules, somebody from the other side of the world will always be able to help you with technical issues over a remote access connection.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 07-16-2020 06:18 PM

@Kurt_Bremser is so much more dedicated than I am!

My "IT is still a mess" looks lazy now, compared to his detailed comments! 🙂

Bottom line: Different systems use different character representations, and until all systems use the same one (UTF-8 seems to be becoming the new standard, but Windows is very late to the party), this issue is unavoidable.

One system writes something using encoding A, another system expects encoding B, and wheels come to a stop.

Maybe in 20 years all systems and all legacy data will be UTF-8 (here is wishful thinking), and we'll breathe easier.

Two thoughts:

1. As far as I know, Solaris's default encoding is UTF-8. Why do you have latin folder names? I wonder if Putty is confusing things here. What do you use it for?

2. Run proc options group=languagecontrol; run; to see what encoding is set for the SAS session.

High-Performance SAS Coding - Third Edition

MariaD · Posted 07-17-2020 01:28 PM

Hi @ChrisNZ ,

Follows the results of proc options:

Group=LANGUAGECONTROL
 DATESTYLE=MDY     Specifies the sequence of month, day, and year when ANYDTDTE, ANYDTDTM, or ANYDTTME informat data is ambiguous.
 DFLANG=ENGLISH    Specifies the language for international date informats and formats.
 DSCAS             Runs the DATA step on the CAS server.
 EXTENDOBSCOUNTER=YES
                   Specifies whether to extend the maximum number of observations in a new SAS data file.
 LOCALEDATA=SASLOCALE
                   Specifies the location of the locale database.
 NOLOGLANGCHG      Disables changing the language of the SAS log when the LOCALE= option is changed.
 NOLOGLANGENG      Write SAS log messages based on the values of the LOGLANGCHG, LSWLANG=, and LOCALE= options when SAS started.
 LSWLANG=LOCALE    Specifies the language for SAS log and ODS messages when the LOCALE= option is set after SAS starts.
 MAPEBCDICTOASCII= Specifies the transcoding table that is used to convert characters from ASCII to EBCDIC and EBCDIC to ASCII.
 NONLDECSEPARATOR  Disables formatting of numeric output using the decimal separator for the locale.
 NOODSLANGCHG      Disables changing the language of the SAS message text in ODS output when the LOCALE option is set after start 
                   up.
 PAPERSIZE=LETTER  Specifies the paper size to use for printing.
 RSASIOTRANSERROR  Displays a transcoding error when illegal values are read from a remote application.
 TIMEZONE="GMT-03:00"
                   Specifies a time zone.
 TRANTAB=(lat1wlt1,wlt1lat1,lat1_ucs,lat1_lcs,lat1_ccl,,,)
                   Specifies the translation table catalog entries.
 URLENCODING=SESSION
                   Specifies whether the argument to the URLENCODE function and to the URLDECODE function is interpreted using the 
                   SAS session encoding or UTF-8 encoding.
 NODBCS            Disables double-byte character sets.
 DBCSLANG=NONE     Specifies a double-byte character set language.
 DBCSTYPE=NONE     Specifies the encoding method that is used for a double-byte character set.
2                                                          The SAS System                              17:48 Thursday, July 16, 2020

 FSDBTYPE=DEFAULT  Specifies a full-screen double-byte character set (DBCS) encoding method.
 FSIMM=            Specifies input method modules (IMMs) for full-screen double-byte character sets (DBCS).
 FSIMMOPT=         Specifies options for input method modules (IMMs) that are used with a full-screen double-byte character set 
                   (DBCS).
 ENCODING=LATIN1   Specifies the default character-set encoding for the SAS session.
 LOCALE=EN_US      Specifies a set of attributes in a SAS session that reflect the language, local conventions, and culture for a 
                   geographical region.
 NONLSCOMPATMODE   Encodes data using the SAS session encoding.

Thanks,

MariaD · Posted 07-17-2020 01:33 PM

An additional comment, the results of PROC OPTIONS for Linux (on Linux works fine) environment is exactly the same.

Regards,

ChrisNZ · Posted 07-17-2020 08:20 PM

Here is my best guess, a stab in the dark really.

Findings & facts so far (you haven't replied to some questions):

- Solaris should use UTF-8

- Putty should use UTF-8

- The folder name is displayed correctly in Putty, which makes sense as they use the same encoding: The folder name uses UTF-8 too.

- SAS uses wlatin1 for its data

- Because SAS is aware of the underlying OS, is reads the folder name correctly, using UTF-8 (1)

- A new folder is created by SAS (please confirm you have 2 folders) when you run SAS code (2)

- This new folder has a name that is not interpreted as UTF-8 by Solaris and Putty for some reason. We don't know why (3).

=> There are still many gaps in our establishing the facts.

One logical answer would be to set SAS to use UTF-8 so everything is coherent.

UTF-8 seems to be the new emerging standard anyway, so moving there is desirable.

Can you try that to see if it fixes this one issue?

Note that some data might need translating. SAS can do that on the fly using CEDA.

Notes:

(1) The folder name is displayed correctly:

26         libname test '/usuarios/nuevo_usuario/Histórico';
NOTE: Libref TEST was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /usuarios/nuevo_usuario/Histórico

(2) I don't see why a new folder should ever be created when running a data step.

The creation may be linked to the LIBNAME statement instead, provided option DLCREATEDIR is turned on.

Please confirm when the folder is created.

(3) You could look at the folder name using an hex editor to see each individual byte, and see why a seemingly legal UTF-8 byte sequence is not combined.

High-Performance SAS Coding - Third Edition

MariaD · Posted 07-20-2020 08:48 AM

Hi @ChrisNZ ,

Please find my comments below:

- Solaris should use UTF-8 -- It's UTF 8

- Putty should use UTF-8 -- It's UTF 8

- SAS uses wlatin1 for its data -- Yes, on our previous environment (Linux) was already setting as WLATIN1

- A new folder is created by SAS (please confirm you have 2 folders) when you run SAS code (2) -- No, the folder is not created when run the libname statement. One of the folders, the corrected one, was created using mkdir through PuTTY. The other one, was created through SAS EG under Server --> Files --> New Folder. The last one, appears as " HistÃ³rico".

- This new folder has a name that is not interpreted as UTF-8 by Solaris and Putty for some reason. We don't know why (3). -- In fact, SAS does not interpret the corrected name, "Histórico", for folders and for SAS datasets too. On Solaris or PuTTY, we can navigate through folders without any problem.

Can you try that to see if it fixes this one issue? -- I'll make a test. But in our previous environment (Linux) we have exactly the same configuration and works fine on it.

Note that some data might need translating. SAS can do that on the fly using CEDA. -- Yes, we use a CPORT procedure to translate from Linux to Solaris.

Follows an example to visualise what is going on.

Case 1:

a) I created a folder called "Histórico" through PuTTY.

Screen Shot 2020-07-20 at 09.37.51.png

b) I'll try to assign a libname TEST using the folder "Histórico" on SAS EG.

25         GOPTIONS ACCESSIBLE;
26         libname TEST '/usaurios/sasdemo/Histórico';
NOTE: Library TEST does not exist.

Case 2

a) I deleted the previous folder created. I created again a folder called "Histórico" but now through SAS EG (Server --> Files --> New Folder). If I navigate to the folder using PuTTY, now the folders appears as "HistÃ³rico"

Screen Shot 2020-07-20 at 09.36.36.png

b) I run the same libname statement and now the libname is assigned.

25         GOPTIONS ACCESSIBLE;
26         libname TEST '/usuarios/sasdemo/Histórico';
NOTE: Libref TEST was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /export/home/sasdemo/Histórico

Regards,

Kurt_Bremser · Posted 07-20-2020 10:02 AM

What happens if you use the DCREATE() function?

Is the directory accessible through the LIBNAME statement?

How is it displayed in putty?

From your description, I now have the suspicion that EG and SAS use UTF-8 encoding when sending the name to the filesystem, but putty recognizes the two-byte UTF sequence as two separate WLATIN characters.

Do you have legacy directories that were created on the Linux system and then either mounted to your Solaris server, or copied through NFS? If yes, how do those display in EG and putty?

How did you determine that the Solaris server is configured to use UTF?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

MariaD · Posted 07-20-2020 05:51 PM

Hi @Kurt_Bremser ,

I created a new folder, called Histórico, using DCREATE function. After that, I assign this directory to libname. Everything looks fine on SAS EG. But the if I valide it on our server, the folder was created as "HistÃ³rico".

Yes, we have legacy directories that were created on the Linux system and then mounted to Solaris. On SAS EG, if I try to create a LIBNAME using it an error appears because SAS EG does not recognise the folder.

Regards,

Tom · Posted 07-20-2020 05:56 PM

@MariaD wrote:

Hi @Kurt_Bremser ,

I created a new folder, called Histórico, using DCREATE function. After that, I assign this directory to libname. Everything looks fine on SAS EG. But the if I valide it on our server, the folder was created as "HistÃ³rico".

Yes, we have legacy directories that were created on the Linux system and then mounted to Solaris. On SAS EG, if I try to create a LIBNAME using it an error appears because SAS EG does not recognise the folder.

Regards,

What encoding is your SAS session using? Check the ENCODING option.

%put %sysfunc(getoption(encoding));

What encoding is SAS Enterprise Guide using. I have no idea how to check that as I don't use that interface.

MariaD · Posted 07-20-2020 06:21 PM

Hi @Tom ,

Follows the results:

26         %put %sysfunc(getoption(encoding));
LATIN1

It's exactly the same results for our previous server and I use the same version of SAS EG (on the same client machine) to connect to it.

Regards,

Kurt_Bremser · Posted 07-21-2020 12:05 AM

So your whole new setup seems to work with UTF (even SAS), and only the commandline via putty doesn't.

You should open a track with SAS technical support as to why a SAS session with WLATIN1 encoding creates UTF-encoded directories. I don't think any of the helpers here has access to a Solaris server to recreate the issue.

Your best bet, though, is to replace all those accented or otherwise special characters with their standard counterparts. The length to which this thread has grown (without leading to a practicable solution), and the need to involve SAS TS shows how unintelligent the use of special characters in server filesystems is, and the rules I posted earlier gain more and more weight.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 07-20-2020 07:43 PM

Thank you for these detailed explanations.

From your explanations, it appears that EG fails to create the directory properly.

I'd be tempted to say: Don't! The dlcreatedir option is a better way.

More seriously, it seems that this is a faulty EG behaviour, and should be tracked with Tech Support.

One piece of information that might be useful: What happens when you don't delete the Putty-generated folder before creating it with EG? This doesn't mean much as EG might be able to recognise folder names correctly even if it fails to create folders correctly, but it is interesting nonetheless.

High-Performance SAS Coding - Third Edition

Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Re: Solaris | Encoding

Registration is open

Registration is open

SAS Training: Just a Click Away