BookmarkSubscribeRSS Feed
HeatherNewton
Quartz | Level 8

I converted sas dataset into csv using encoding utf8 and it looks fine when shown on notepad. when I try to load into db2, strangely in each result csv the first character has a strange character u[feff]', could you let me know why this happen?

5 REPLIES 5
Tom
Super User Tom
Super User

That is the BYTE ORDER MARK or BOM.

You should check with DB2 commands you are using to see how you can get it to ignore the BOM.

 

Otherwise just don't include the BOM when writing the file, as explained in the question from years ago.

https://communities.sas.com/t5/SAS-Programming/Write-a-file-in-UTF-8-without-BOM/td-p/561069

 

 

HeatherNewton
Quartz | Level 8

does it only happen to the first character as an additional character and I can remove it such the rest of the file is same or original?

Tom
Super User Tom
Super User

The BOM is the start of the file.  Normal code should recognize it and IGNORE it (and also inform how they interpret the rest of the file).

 

If you set the SAS option NOBOMFILE option before you create the CSV file then it will not be written to the file at all.  That should allow your confused DB2 load program to load the data.

 

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/nlsref/n0ovum2tekkxadn1jel1gkpj5wx8.htm

HeatherNewton
Quartz | Level 8

I can’t convert again so must change db2 command or remove them.

 

something worry me, before I used encoding ‘any’ to convert saw datasets to csv

but some columns cannot show in db2 due to character conversion problem, so I used utf8 but I am worried that some data originally showing fine cannot show as I am not using encoding ‘any’, what can’t of encoding would be a problem.. I remember seeing different data format in some SAS programs.. but I must say most should work with utf8?

Tom
Super User Tom
Super User

You will need to discuss with your database administrator how DB2 is configured to handle UTF-8 encoded text (or if it even can).

 

The original ASCII encoding used only 7 bits and so there were only 128 possible characters that could be represented.  Of those the first 31 and the last 1 where used for non-printable control characters.

 

As people started using computers for more than just programming numbers they first expanded to using all 8 bits of a byte to represent characters.  So know you had another 128 characters that could be encoded.  But which characters should be added?  Some encodings like WLATIN1 add characters from "western" languages, like French and Spanish.  Others used the extra characters for mathematical symbols. etc.

 

The UTF-8 coding scheme using multiple bytes for some characters.  The original 128 7-bit ASCII characters are the same, but for the other characters it uses 2,3 or even 4 bytes to store them.  This allows for thousands of characters to be represented. But it makes dealing with character strings more complex.

 

Conversion from any particular single byte encoding, such as WLATIN1, to UTF-8 is simple.  But trying to convert from UTF-8 to some other single byte encoding might fail if none of the 256 codes in that single byte encoding represents the code that is in the UTF-8 string.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 713 views
  • 0 likes
  • 2 in conversation