Re: strange character appear after conversion from sas dataset to csv

HeatherNewton · Posted 08-21-2023 09:07 AM

I converted sas dataset into csv using encoding utf8 and it looks fine when shown on notepad. when I try to load into db2, strangely in each result csv the first character has a strange character u[feff]', could you let me know why this happen?

Tom · Posted 08-21-2023 09:24 AM

That is the BYTE ORDER MARK or BOM.

You should check with DB2 commands you are using to see how you can get it to ignore the BOM.

Otherwise just don't include the BOM when writing the file, as explained in the question from years ago.

https://communities.sas.com/t5/SAS-Programming/Write-a-file-in-UTF-8-without-BOM/td-p/561069

HeatherNewton · Posted 08-21-2023 10:29 AM

does it only happen to the first character as an additional character and I can remove it such the rest of the file is same or original?

Tom · Posted 08-21-2023 12:05 PM

The BOM is the start of the file. Normal code should recognize it and IGNORE it (and also inform how they interpret the rest of the file).

If you set the SAS option NOBOMFILE option before you create the CSV file then it will not be written to the file at all. That should allow your confused DB2 load program to load the data.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/nlsref/n0ovum2tekkxadn1jel1gkpj5wx8.htm

HeatherNewton · Posted 08-21-2023 12:27 PM

I can’t convert again so must change db2 command or remove them.

something worry me, before I used encoding ‘any’ to convert saw datasets to csv

but some columns cannot show in db2 due to character conversion problem, so I used utf8 but I am worried that some data originally showing fine cannot show as I am not using encoding ‘any’, what can’t of encoding would be a problem.. I remember seeing different data format in some SAS programs.. but I must say most should work with utf8?

Tom · Posted 08-21-2023 12:41 PM

You will need to discuss with your database administrator how DB2 is configured to handle UTF-8 encoded text (or if it even can).

The original ASCII encoding used only 7 bits and so there were only 128 possible characters that could be represented. Of those the first 31 and the last 1 where used for non-printable control characters.

As people started using computers for more than just programming numbers they first expanded to using all 8 bits of a byte to represent characters. So know you had another 128 characters that could be encoded. But which characters should be added? Some encodings like WLATIN1 add characters from "western" languages, like French and Spanish. Others used the extra characters for mathematical symbols. etc.

The UTF-8 coding scheme using multiple bytes for some characters. The original 128 7-bit ASCII characters are the same, but for the other characters it uses 2,3 or even 4 bytes to store them. This allows for thousands of characters to be represented. But it makes dealing with character strings more complex.

Conversion from any particular single byte encoding, such as WLATIN1, to UTF-8 is simple. But trying to convert from UTF-8 to some other single byte encoding might fail if none of the 256 codes in that single byte encoding represents the code that is in the UTF-8 string.

strange character appear after conversion from sas dataset to csv