Re: Reading Unicode from Snowflake into SAS via ODBC

AndrewZ · Posted 08-31-2023 01:46 PM

In a SAS Unicode session, data from Snowflake is coming across as iso-8859-1, even though the SAS data set marks the data set as Unicode.

I connect using

libname mysf odbc
complete="DRIVER={SnowflakeDSIIDriver};SERVER=...";

I can use PROC DATASETS with CORRECTENCODING to change the data set to iso-8559-1, and then compatible characters (like Spanish and German) look okay. However, non-compatible characters (like Japanese) are always mojibaked or missing.

I am on SAS Unicode 32-bit and Windows 10 Enterprise .

NOTE: SAS (r) Proprietary Software 9.4 (TS1M7 MBCS3170)
NOTE: This session is executing on the W32_10PRO platform.

Snowflake 32-bit ODBC driver: 2.25.04.00

I would rather not pay for a Snowflake connector license just for this one feature.

LinusH · Posted 09-01-2023 04:50 AM

I'm bit confused since Snowflake claims that they always have the data transferred in UTF-8 using their ODBC driver.

https://community.snowflake.com/s/article/Does-the-Snowflake-ODBC-Driver-support-non-Latin-character...

What is the actual ENCODING setting in your SAS session?

Make sure that you download the data to a clean location to avoid encoding miss-match.

Data never sleeps

AndrewZ · Posted 09-01-2023 12:21 PM

@LinusH

"proc options option=encoding; run;" tells me my encoding is UTF-8, and when I check the properties of the SAS data set under WORK (created by querying Snowflake), the encoding is UTF-8 too.

What do you mean a clean location? I am creating a v9 SAS data in WORK.

LinusH · Posted 09-04-2023 07:34 AM

Yes, that's is as clean as it gets.
Following @Tom here, SAS Support could be your next step.
Another thing to try is to open this data from another application on your computer, like Excel and compare.

Data never sleeps

Tom · Posted 09-01-2023 04:11 PM

Sounds like the data in the Snowflake database is not UTF-8?

Try a small query with some characters that you know are non-7bit ASCII characters and print the actual hex codes.

data test;
  set mysf.have(encoding=any);
  put (_character_) (=$hex. /);
run;

AndrewZ · Posted 09-01-2023 04:48 PM

I created a Unicode table in Snowflake via DBeaver with this code

create or replace TABLE zzz_unicode ( language_code int, text varchar(100) );
insert into zzz_unicode (language_code, text) values
(1, 'Ich kann Glasssplitter essen, es tut mir nicht weh'),
(2, 'Je peux manger du verre, ça ne me fait pas mal'),
(3, 'Posso mangiare vetro, non mi fa male'),
(4, 'Eu posso comer vidro, não me faz mal'),
(5, 'Puedo comer vidrio, no me hace daño'),
(6, 'Я могу есть битое стекло, оно мне не вредит'),
(7, 'ฉันสามารถกินแก้วแตกได้ มันไม่ทำให้ฉันเจ็บปวด'),
(8, '私は割れたガラスを食べることができます、それは私を傷つけません'),
(9, 'እኔ የተሰነጠቀ ብረት መብላት እችላለሁ፣ አይጎዳኝም'),
(10, 'ငါ ብስጭት መብላት እችላለሁ, ጎጂ አይደለም');
insert into zzz_unicode (language_code, text) values
(11, 'Я могу есть битое стекло, оно мне не вредит'),
(12, '私は割れたガラスを食べることができます、それは私を傷つけません'),
(13, '我可以吃碎玻璃，它不会伤害我');

select *
from zzz_unicode
;

This shows how the table looks in DBeaver (looks good) vs SAS PROC PRINT (looks bad).

Here is the SAS log for the HEX test you suggested. The character 1A is SUB, and 20 is space, so we see the mojibake in HEX too. Also, it doesn't like encoding=any.

348
349 data test;
350 set jet.zzz_unicode(encoding=any);
--------
76
WARNING 76-63: The option ENCODING is not implemented in the ODBC engine.

351 if language_code in (6,13);
352 put (text) (=$hex. /);
353 run;

TEXT=1A201A1A1A1A201A1A1A1A201A1A1A1A1A201A1A1A1A1A1A2C201A1A1A201A1A1A201A1A201A1A1A1A1A1A20202020202020202020202020202020202020202020202020202020202
0202020202020202020202020202020202020202020202020202020
TEXT=1A1A1A1A1A1A1A1A1A1A1A1A1A1A202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202
0202020202020202020202020202020202020202020202020202020

Tom · Posted 09-01-2023 05:23 PM

Try tweaking the ODBC definition.

https://communities.sas.com/t5/SAS-Programming/ODBC-to-Postgres-DB-is-translating-UTF-8-data-into-La...

AndrewZ · Posted 09-01-2023 06:19 PM

@Tom

How would I change the ODBC definition for Snowflake? The URL you posted was about PostgreSQL, and in the last comment, the author (not the original poster) wrote that he solved by adding "SET CLIENT_ENCODING TO 'UTF8'" to his connection string, but I see no such option for Snowflake.

@LinusH posted a link to Snowflake documentation that states: "the Snowflake ODBC Driver will always use UTF-8 encoding for any STRING/VARCHAR/VARIANT type column data." This is what I need, and I can't change it. Also, I do not see any encoding related options in Snowflake session parameters or ODBC parameters. This is consistent with the documentation the the "ODBC driver will always use UTF-8" because there is no way to change it.

Because Snowflake always uses UTF-8 leads me to conclude that the problem, whether SAS configuration or SAS bug, is with SAS's ODBC integration.

Tom · Posted 09-01-2023 07:18 PM

Did you raise a ticket with SAS support?

I am also surprised to see that you are using 32-bit version of SAS.

Do you have access to 64-bit version of SAS and 64-bit ODBC drivers for snowflake that you can test?

Tom · Posted 09-01-2023 07:22 PM

The other work around is to dump the data to a text file and then read the text file using a data step.

Looks like Snowflake has a COPY TO command for generating text files from data.

https://docs.snowflake.com/en/sql-reference/sql/copy-into-location

Steps would be:
1) Generate delimited text file.

2) Make the text file available to SAS. Might involve copying the file from S3 to work directory.

2) Create dataset with same structure as original table (could be as simple as using OBS=0 dataset option) and then read in the text file.

* Step 3 ;
data want;
   if 0 then set snow.mytable (obs=0);
   infile csv dsd truncover ;
   input (_all_) (+0);
run;

For large files you might find that it works much faster than trying to suck data though that tiny ODBC straw.

AndrewZ · Posted 09-06-2023 08:19 PM

OK, I opened a ticket with SAS support. Let's see how that goes.

The COPY TO is a creative suggestion. I already do the opposite for bulk loading data into Snowflake, so if I get desperate, I will check into that.

I could also write a Python program to read the data from Snowflake and pass it to SAS. Python can write SAS format via sas7bdat, but I am guessing it doesn't support Unicode. Maybe XLSX or Access would be the easiest way. This doesn't seem fun, either.

Quentin · Posted 09-07-2023 09:33 AM

Interesting thread, thanks. Looking forward to hearing what SAS support says. Also just want to say, XLSX is never the answer, and it's usually the problem. 😀

To the big picture, have you tried using snowflake's ODBC driver with some other client (e.g. python, or, ok, even Excel), to see what comes down? If other clients have the same problem, then I think you've got to point back to snowflake support. Just because their docs say they always serve UTF-8 doesn't mean they are.

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Classroom Training Available!