Re: Reading Unicode from Snowflake into SAS via ODBC - Page 2

AndrewZ · Posted 07-18-2024 12:05 PM

Quentin, the iso-8859-1 is not a great workaround for me either.

If you are desperate, an ugly hack is to use a pass-through query to tell Snowflake to write to a text file: see COPY INTO location and GET. Then in SAS use a regular PROC IMPORT. I haven't tried this exactly, but I do the exact opposite to bulk load large data sets quickly from SAS to Snowflake.

If you have access to Snowflake support (yourself or via IT), file a ticket with Snowflake.

Quentin · Posted 09-03-2024 04:16 PM

A month later, I'm back to playing with this. No big insights.

I start with a value: A.Ö. Krankenhaus, stored in snowflake.

In a UTF-8 SAS session when I pull that value from snowflake, it shows that O with umlaut is encoded as "D6"x (using pictures instead of text cuz want to avoid the browser decoding stuff):

The SAS dataset is tagged as UTF-8 encoded, but "D6"x is not the proper UTF-8 encoding for O with umlaut. As per https://www.ascii-code.com/character/%C3%96 it should be encoded as "C396"x. "D6"x is the ISO-8859-1 encoding. So the above does not show O with umlaut, it shows a box.

If I use your PROC DATASETS "trick" to tell SAS that the dataset is not encoded UTF-8, it's actually encoded iso-8859-1:

proc datasets library=work noprint;
modify krak2 / correctencoding='iso-8859-1';
quit;

Then everything works:

That shows that CEDA was used. So I guess since I'm in a UTF-8 SAS session and SAS now knows the data are stored as iso-8859-1, it is translating the "D6"x value (which is the iso-8859-1 encoding for O umlaut) into "C396"x which is the UTF-8 encoding.

So I agree with your conclusion, SAS is somehow receiving iso-8859-1 data from the snowflake ODBC driver, even though it should be sending UTF-8.

I then used the same ODBC driver to read the data into PowerBI, and everything seems to have worked fine. Which makes me wonder if the problem might be on the SAS side.

AndrewZ · Posted 09-03-2024 04:44 PM

@Quentin

A year ago I tested with the ODBC Test utility found in MDAC, but the Unicode was corrupt, so I concluded SAS could not do any better. Today I tested with Excel and Python both connected to an ODBC DSN, and the Unicode looked great, so I will try again with SAS support.

Quentin · Posted 09-04-2024 09:45 AM

Thanks Andrew. I feel like if there were a problem with the ODBC driver Snowflake provides, there would be LOTS of people complaining over at stack overflow etc. When I search for this issue, I don't see many complaints. And my guess is that among SAS folks reading from Snowflake, most folks have SAS/ACCESS to Snowflake, so might not have been tripped up by this issue. I feel like it's probably a SAS problem. They probably won't be able to fix it any time soon, but maybe some brilliant person in tech support will come up with an interesting workaround.

Tom · Posted 09-04-2024 10:15 AM

What happens when the UTF-8 character in Snowflake is NOT one that can be mapped into that single byte encoding? What does that map into in SAS? What do your other methods of accessing Snowflake retrieve?

AndrewZ · Posted 09-04-2024 10:24 AM

Here are screenshots from yesterday comparing SAS Unicode mode and Excel reading the same Snowflake table via the Snowflake ODBC driver. (Excel used a DSN, while in SAS Unicode mode I tested both DSN and connection string with no difference in results.) Characters such as CJK, Russian, and Thai show as boxes in SAS.

Yesterday I also tested Python with pyodbc to Snowflake via DSN, and the result was correct.

A year ago I tested the ODBC Utility in Microsoft's MDAC, but the Unicode was corrupt. A year ago, I also tested Excel, and the Unicode was corrupt. I am not sure what changed for Excel, but it worked better yesterday than a year ago.

AhmedAl_Attar · Posted 09-04-2024 12:36 PM

Hi @AndrewZ

Can you try running this macro against your zzz_unicode snowflake table

/****************************************************************************************************
  File name:  db_getmaxtextcolslengths.sas
  Parent App: 
  File Type:  Macro function

  Purpose:    Generate a SAS data containing the required SAS column lengths for a specified SnowFlake table

  Inputs:     p_inLibRef
		  p_inDbTableName
		  p_inDbSchema
		  p_inDbCatalog
		  p_outDsName

  Outputs:    Required SAS Columns lengths for specified snowflake table
****************************************************************************************************/
%MACRO db_getMaxTextColsLengths(
p_inLibRef=
, p_inDbTableName=
, p_inDbSchema=
, p_inDbCatalog=
, p_outDsName=);

	%LOCAL 
		l_outLibName
		l_outDsName
		l_dbCat
		l_dbSchema
		l_dbTable
		l_colNamesDS
		;

	%let l_outLibName = WORK;
	%let l_outDsName = &p_outDsName;
	%if (%INDEX(&p_outDsName, %STR(.)) GT 0) %then
	%do;
		%let l_outLibName =  %SCAN(&p_outDsName,1,%STR(.));
		%let l_outDsName = %SCAN(&p_outDsName,2,%STR(.));
	%end;

	/* Add required single quotes */
	%let l_dbCat    = %str(%')&p_inDbCatalog.%str(%');
	%let l_dbSchema = %str(%')&p_inDbSchema.%str(%');
	%let l_dbTable  = %str(%')&p_inDbTableName.%str(%');

	%let l_colNamesDS = WORK._tmp_col_names;

	PROC SQL  buffersize=50M noprint noerrorstop exec;
		connect using &p_inLibRef as db;
		CREATE TABLE &l_colNamesDS AS
		SELECT
			snow_col_len length=100 FORMAT=$100.
			,sas_col_len length=200 FORMAT=$200.
			,unicode_char_cnt length=200 FORMAT=$200.
			,col_name
			,col_type
		FROM connection to db 
		(
		SELECT 
			CASE WHEN DATA_TYPE <> 'TEXT' THEN ', '||TO_CHAR(COALESCE(NUMERIC_PRECISION,DATETIME_PRECISION,9),'FM999')||' ' ELSE 
			', MAX(LENGTH('||COLUMN_NAME||'))' END
			|| ' AS snow_V' ||TO_CHAR(ORDINAL_POSITION,'FM999')||'_len' AS snow_col_len 
			,CASE WHEN DATA_TYPE <> 'TEXT' THEN ', 8' ELSE
			', MAX(LENGTH('||COLUMN_NAME||') + (LENGTH(REGEXP_REPLACE('||COLUMN_NAME||', ''[\\x20-\\x7F]'', ''''))* 3))' END 
			|| ' AS sas_V' ||TO_CHAR(ORDINAL_POSITION,'FM999')||'_len' AS sas_col_len
			,CASE WHEN DATA_TYPE <> 'TEXT' THEN ', 0' ELSE
			', MAX(LENGTH(REGEXP_REPLACE('||COLUMN_NAME||', ''[\\x20-\\x7F]'', '''')))' END
			|| ' AS V' ||TO_CHAR(ORDINAL_POSITION,'FM999')||'_unicode_char_cnt' AS unicode_char_cnt
			,COLUMN_NAME as col_name
			,DATA_TYPE as col_type
		FROM
			"&p_inDbCatalog"."INFORMATION_SCHEMA"."COLUMNS"
		WHERE TABLE_CATALOG = &l_dbCat
		AND TABLE_SCHEMA = &l_dbSchema
		AND TABLE_NAME = &l_dbTable
		/*AND DATA_TYPE = 'TEXT'*/
		ORDER BY ORDINAL_POSITION
		);
		disconnect from db;
	QUIT;
	
	FILENAME dyncode TEMP;

	DATA _NULL_;
		FILE dyncode lrecl=300;
		PUT '/* Data generatd Proc Sql call */';
		PUT 'PROC SQL; ';
		PUT +3 "CONNECT USING &p_inLibRef AS db; ";
		PUT +3 "CREATE TABLE &p_outDsName AS ";
		PUT +3 'SELECT * ';
		PUT +3 'FROM connection to db ( ';
		PUT +6 'SELECT ';
		PUT +9 "'&p_inDbCatalog..&p_inDbSchema..&p_inDbTableName' AS Full_table_name ";
		DO UNTIL (eof);
			SET &l_colNamesDS end=eof;
			PUT +9 snow_col_len ;
			PUT +9 sas_col_len ;
			PUT +9 unicode_char_cnt;
		END;
	
		PUT +6 'FROM "&p_inDbCatalog"."&p_inDbSchema"."&p_inDbTableName" ); ';
		PUT +3 'disconnect from db; ';
		PUT 'QUIT; ';
		PUT ;
		PUT "DATA &p_outDsName(KEEP=Full_table_name col_name col_type snow_col_len sas_col_len max_unicode_char_cnt) ; ";
		PUT +3 "if (0) then SET &l_colNamesDS (KEEP=col_name col_type); ";
		PUT +3 "SET &p_outDsName; ";
		PUT +3 'RETAIN Full_table_name; ';
		PUT +3 'Array saslens {*} 5 sas_: ; ';
		PUT +3 'Array snowLens {*} 5 snow_: ; ';
		PUT +3 'Array uniLens {*} 5 v: ; ';
		PUT +3 'LENGTH snow_col_len sas_col_len max_unicode_char_cnt 5; ';
		PUT;
		PUT +3 'do _rec = 1 by 1 until(eof); ';
		PUT +6 "SET &l_colNamesDS (KEEP=col_name col_type) end=eof; ";
		PUT +6 'snow_col_len = snowLens[_rec]; ';
		PUT +6 'sas_col_len = sasLens[_rec]; ';
		PUT +6 'max_unicode_char_cnt = uniLens[_rec]; ';
		PUT +6 'OUTPUT; ';
		PUT +3 'end; ';
		PUT +3 'STOP; ';
		PUT 'RUN; ';
	RUN;
	%include dyncode;
%MEND db_getMaxTextColsLengths;

%db_getMaxTextColsLengths(
p_inLibRef=<SnowFlake-Libref>
, p_inDbCatalog=<SnowFlake-Catalog>
, p_inDbSchema=<SnowFlake-Schema>
, p_inDbTableName=zzz_unicode
, p_outDsName=<LIB.>SAS-DS-NAME)

Examine/Open the generated data set to see how long the SAS Character variable(s) have to be to properly represent the contents. Most definitely SAS variable(s) will require larger lengths.

Hope this helps,

Ahmed

Quentin · Posted 09-04-2024 01:04 PM

Hi @AhmedAl_Attar . Understood that encoding data in UTF-8 will require longer variables, but I think SAS should handle that gracefully, in a SAS session with UTF-8 encoding.

More to the point, the issue does not seem to be that variables are being truncated. Instead it looks like the SAS dataset which has a dataset attribute stating it is UTF-8 encoded, has values that are actually iso-8859 encoded. Please see this post: https://communities.sas.com/t5/SAS-Programming/Reading-Unicode-from-Snowflake-into-SAS-via-ODBC/m-p/.... I think that post shows that the O-umlaut character is encoded incorrectly in the SAS dataset. But the values are not being truncated.

SASKiwi · Posted 07-15-2024 07:30 PM

Is your session encoding UTF-8?

proc options option = encoding;
run;

AndrewZ · Posted 07-16-2024 05:21 PM

Quentin, I did extensive testing on this issue in SAS and other tools. In SAS, I used SAS Unicode (utf-8) and SAS English (wlatin1).

My workaround in SAS Unicode is to run PROC DATASETS like below every time I pull in data from Snowflake, but it only gives me iso-8859-1, which seems to be a limitation of the Snowflake ODBC driver.

proc datasets library=&lib noprint;
modify &ds / correctencoding='iso-8859-1';
quit;

Tom · Posted 07-17-2024 03:03 PM

@AndrewZ wrote:

Quentin, I did extensive testing on this issue in SAS and other tools. In SAS, I used SAS Unicode (utf-8) and SAS English (wlatin1).

My workaround in SAS Unicode is to run PROC DATASETS like below every time I pull in data from Snowflake, but it only gives me iso-8859-1, which seems to be a limitation of the Snowflake ODBC driver.

proc datasets library=&lib noprint;
modify &ds / correctencoding='iso-8859-1';
quit;

I am not sure what your second paragraph means.

Did you look at the hexcodes in the dataset? Were they the valid UTF-8 bytes you expected?

That PROC DATASETS code will just change the metadata attribute that indicates the encoding used to create the file. Changing the metadata about the encoding of the text in the dataset will not change what is in the dataset. It just tells future users of the data what to expect to find when they look at the data.

AndrewZ · Posted 07-17-2024 04:32 PM

Did you look at the hexcodes in the dataset? Were they the valid UTF-8 bytes you expected?

You mean use a hex editor on the .sas7bdat file? No. Based on my other tests (like one in the next paragraph), Snowflake was not sending UTF-8.

That PROC DATASETS code will just change the metadata attribute that indicates the encoding used to create the file. Changing the metadata about the encoding of the text in the dataset will not change what is in the dataset. It just tells future users of the data what to expect to find when they look at the data.

Yes, I understand. The PROC DATASETS step fixed encoding for Spanish characters (like á, é, í, ó, ú, ñ), German characters ( like ä, ö, ü), and "smart" quotation marks usually made by Microsoft Office, but not other texts like Korean, so that implies Snowflake ODBC was sending the text as iso-8859-1 instead of utf-8.

If Snowflake ODBC were sending UTF-8, PROC DATASETS would not have this effect in SAS.

AhmedAl_Attar · Posted 07-17-2024 04:13 PM

@AndrewZ

Have you tried this Libname option?

DBCLIENT_MAX_BYTES= LIBNAME Statement Option

One think to keep in mind, Variable length in Snowflake are based on character count, while in SAS, they are based on Byte count!

Therefore what could be stored in Snowflake within a char/varchar (1) may require a SAS variable of length 2+ in order to correctly display the values.

Hope this helps

AndrewZ · Posted 09-04-2024 01:11 PM

@AhmedAl_Attar

In Snowflake, I increased length to varchar(1000), and still in Snowflake, and I calculated len(text_sample). This screenshot of SAS shows maximum length of any string was 49, but in SAS, the text was still corrupt.

AhmedAl_Attar · Posted 09-04-2024 02:11 PM

Just to verify you are using
SAS (Unicode support)
sas.exe -config "<SASHome>\SASFoundation\9.4\nls\u8\sasv9.cfg"
Right? to confirm, you can always execute
proc options option=encoding; run;

Registration is open

SAS Training: Just a Click Away