DATA Step, Macro, Functions and more

Error on UTF-8 server with dataset using Wlatin1

Reply
Contributor ckx
Contributor
Posts: 54

Error on UTF-8 server with dataset using Wlatin1

We're experimenting with a SAS server configured for UTF-8 encoding alongside our standard configuration that uses Wlatin1. One very strange thing I came across was that a dataset that had been created using the standard Wlatin encoding produced an error message when reading it on the UTF-8 server.

 

ERROR: Some character data was lost during transcoding in the dataset ADAM.ADLB. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

 

Other datasets gave no such problems. I checked the datasets using proc contents. The datsets used the Wlatin1 encoding and had  been compressed using the CHAR method. 

 

The problem could be solved by using the dataset option "encoding=any" or "encoding=utf8". But I would really like to know why the error only occurred in one dataset and not in others. Is this just something random? The dataset in question contained lab data and had symbols such as mu (the Wlatin1 version 'B5'x as opposed to ^{unicode 03BC}). Could that explain it?

 

 

Super User
Posts: 9,687

Re: Error on UTF-8 server with dataset using Wlatin1

Try the following code :

 

 

libname have cvp '/folders/myfolders/';
proc copy in=have out=work noclone;
 select ADLB ;
run;
Contributor ckx
Contributor
Posts: 54

Re: Error on UTF-8 server with dataset using Wlatin1

Hi Ksharp,

 

That's another workaround but what I'm looking for is an explanation. Why is a dataset with Wlatin1 ecoding a problem on an UTF-8 server? And why does this problem occur for one dataset but not for others?

Super User
Posts: 9,687

Re: Error on UTF-8 server with dataset using Wlatin1

As you pionted out before,in that table  there must be some character which aren't compatible with your SAS session .

Different character set have different character.

Contributor ckx
Contributor
Posts: 54

Re: Error on UTF-8 server with dataset using Wlatin1

Here's a followup to my post with an explanation why some datasets with Wlatin1 encoding can produce an error message when being read from a SAS configuration using the UTF-8 encoding.

 

First, some background. ASCII is a well-known encoding standard but strictly speaking, ASCII is restricted to the first 128 characters."Wlatin1", referred to on Wikipedia as "Windows-1252" was devised by Microsoft and extends the ASCII encoding to 256 characters. Windows-1252 is itself derived from the ISO/IEC 8859-1 except the range 128 to 159 (hex 80 to 9F). And that's the key to the problem I encountered.

 

You see, UTF-8 is backwards compatible with ISO 8859-1. Which means that UTF-8 is mostly backward compatible with Wlatin1. If you look at the chart of character codes halfway down the page at https://en.wikipedia.org/wiki/Windows-1252, you can see the diffeerences highlighted with a green outline. Characters such as the euro symbol €, dagger †, trademark "™", are different in Windows-1252 and ISO 8859-1 and are therefore not compatible with UTF-8.

 

To test this, I created a simple permanent dataset with a single character.

 

data adam.char;
    thischar="€";
run;

I didn't have any troubles using 'thischar="µ";'. The UTF-8 configuration reported that transcoding had taken place but there was no error message. But using the euro sign, I got the error message reported in my original post. This isn't a valid ISO 8859-1 character and it therefore can't be read by the UTF-8 configuration without further encoding instructions.

 

I hope this is clear enough. The bottom line is that the Wlatin1 encoding can be read using a UTF-8 configuration, except for characters in the range 128-159

SAS Employee
Posts: 8

Re: Error on UTF-8 server with dataset using Wlatin1

SAS UTF-8 supports all of the characters that are available in the other encodings that SAS supports. If the characters are not ASCII, they must be transcoded to the UTF-8 representation. When you see a transcoding error in UTF-8, it usually means that one or more of the character columns in the original data set is not wide enough to hold all of the bytes neeed for the UTF-8 version of the string. 

 

You can use the SAS CVP engine in your UTF-8 session to pad the character columns in the data set. CVP is read-only, so you still need to copy the data set. Also, as you found, you will want the new data set to use the new SAS session attributes, including ENCODING. PROC COPY with the NOCLONE option is one way to achieve that. Another is to use the PROC DATASETS MODIFY statement with the OVERRIDE option.

 

The National Language Reference Guide for 9.4 has some sample code showing how to use CVP. See the section "Avoiding Character Data Truncation by Using the CVP Engine" in the "Transcoding in NLS" chapter of the NLS Concepts section.

Ask a Question
Discussion stats
  • 5 replies
  • 640 views
  • 0 likes
  • 3 in conversation