Solved: Re: Importing text file with Unicode U+0099

Autotelic · Posted 11-02-2017 01:02 PM

I'm on SAS Enterprise Guide 7.11 HF5 (7.100.1.2856) (64-bit), SAS 9.4.

When importing a csv file, I get the warning "WARNING: A character that could not be transcoded has been replaced in record 1726." in some rows (the number at the end changes according to the row it is reading).

I was able to determine that the character causing the problem is the Unicode Character U+0099.

According to Notepad++, the fiile is enconded in "UCS-2 LE BOM" and I'mporting the file in SAS using the encoding "UTF-16".

Can't SAS read this character? Am I not usin the correct encoding option in SAS?

Am I doing something wrong? If I'm not, how can I get SAS to ignore this warning?

David_McNamara · Posted 11-03-2017 08:46 AM

Thanks @Autotelic.

OK, so here is what's happening... in order for us to be able to import the file into SAS, it has to be in an encoding that the SAS System will understand. If the SAS System encounters any characters that are not within its current encoding then it will throw a Transcoding Error and your job will stop. It kind of treats those as serious errors.

In order to prevent that from happening, the Import Data task reads the file in whatever encoding it's been told the file is in (either through information in the file or by you, the user, specifying the encoding to use) and checks to see if each character in the file maps directly to a matching character in the server's encoding. If there is no matching character, then it replaces it with a space character and puts that message you are seeing in the log.

I've looked up the UTF-16 character U+0099 in a document on the unicode.org website and have found that it is simply described as a control code - it is not even named, as most of the recognized Unicode characters are. Normally, we'd be able to look up a character in UTF-16 LE and find out its purpose and then match it with a similarly named character in the server's encoding (WLATIN1 in your case) but we can't do that. So I'm pretty certain that there will be no one-to-one match for that 'control code' character in WLATIN1 (or probably any other encoding for that matter).

So I'm pretty sure that the Import Data task was doing the right thing in removing what would otherwise have been a transcoding error from the file.

I hope this explanation helps.

David.

View solution in original post

Tom · Posted 11-02-2017 01:58 PM

Connect to a SAS server process that is using UTF-8 as the encoding.

Do you have a SAS administrator to help you with this?

Autotelic · Posted 11-02-2017 02:49 PM

I'm on a local server.
Alas, I do not have access to an administrator.

Tom · Posted 11-02-2017 03:47 PM

Maybe @ChrisHemedinger can tell us how to change the ENCODING setting that the LOCAL server uses?

David_McNamara · Posted 11-02-2017 04:02 PM

Hi @Autotelic,

Could you tell me what encoding of your local server is running with?
That can be found by going to the Servers list in Enterprise Guide, making sure that you are connected to your local server and then right-click on the local server node and displaying the Properties dialog for the server. On the Software tab in the Properties dialog, you will see "SAS Session Encoding". That value is the server's encoding.

Thanks, David.

Autotelic · Posted 11-03-2017 05:49 AM

Hi, David.

It's wlatin1.

David_McNamara · Posted 11-03-2017 08:46 AM

Thanks @Autotelic.

OK, so here is what's happening... in order for us to be able to import the file into SAS, it has to be in an encoding that the SAS System will understand. If the SAS System encounters any characters that are not within its current encoding then it will throw a Transcoding Error and your job will stop. It kind of treats those as serious errors.

In order to prevent that from happening, the Import Data task reads the file in whatever encoding it's been told the file is in (either through information in the file or by you, the user, specifying the encoding to use) and checks to see if each character in the file maps directly to a matching character in the server's encoding. If there is no matching character, then it replaces it with a space character and puts that message you are seeing in the log.

I've looked up the UTF-16 character U+0099 in a document on the unicode.org website and have found that it is simply described as a control code - it is not even named, as most of the recognized Unicode characters are. Normally, we'd be able to look up a character in UTF-16 LE and find out its purpose and then match it with a similarly named character in the server's encoding (WLATIN1 in your case) but we can't do that. So I'm pretty certain that there will be no one-to-one match for that 'control code' character in WLATIN1 (or probably any other encoding for that matter).

So I'm pretty sure that the Import Data task was doing the right thing in removing what would otherwise have been a transcoding error from the file.

I hope this explanation helps.

David.

Autotelic · Posted 11-03-2017 08:56 AM

Thanks, David. I understood everything.
Is there a way to make it so that this specific error, specifically for this character, doesn't yield a warning?

David_McNamara · Posted 11-03-2017 09:04 AM

Unfortunately, no there isn't a way to suppress the message. Because we are changing your data, we want to make sure that you are aware that it has been done and so we always produce that message.
The only thing I could suggest would be using Notepad++ to search for the character in your data file and change it to a space before trying to import it in with EG.

SAS Innovate 2025: Call for Content

Classroom Training Available!