Quentin, the iso-8859-1 is not a great workaround for me either.
If you are desperate, an ugly hack is to use a pass-through query to tell Snowflake to write to a text file: see COPY INTO location and GET. Then in SAS use a regular PROC IMPORT. I haven't tried this exactly, but I do the exact opposite to bulk load large data sets quickly from SAS to Snowflake.
If you have access to Snowflake support (yourself or via IT), file a ticket with Snowflake.
A month later, I'm back to playing with this. No big insights.
I start with a value: A.Ö. Krankenhaus, stored in snowflake.
In a UTF-8 SAS session when I pull that value from snowflake, it shows that O with umlaut is encoded as "D6"x (using pictures instead of text cuz want to avoid the browser decoding stuff):
The SAS dataset is tagged as UTF-8 encoded, but "D6"x is not the proper UTF-8 encoding for O with umlaut. As per https://www.ascii-code.com/character/%C3%96 it should be encoded as "C396"x. "D6"x is the ISO-8859-1 encoding. So the above does not show O with umlaut, it shows a box.
If I use your PROC DATASETS "trick" to tell SAS that the dataset is not encoded UTF-8, it's actually encoded iso-8859-1:
proc datasets library=work noprint;
modify krak2 / correctencoding='iso-8859-1';
quit;
Then everything works:
That shows that CEDA was used. So I guess since I'm in a UTF-8 SAS session and SAS now knows the data are stored as iso-8859-1, it is translating the "D6"x value (which is the iso-8859-1 encoding for O umlaut) into "C396"x which is the UTF-8 encoding.
So I agree with your conclusion, SAS is somehow receiving iso-8859-1 data from the snowflake ODBC driver, even though it should be sending UTF-8.
I then used the same ODBC driver to read the data into PowerBI, and everything seems to have worked fine. Which makes me wonder if the problem might be on the SAS side.
A year ago I tested with the ODBC Test utility found in MDAC, but the Unicode was corrupt, so I concluded SAS could not do any better. Today I tested with Excel and Python both connected to an ODBC DSN, and the Unicode looked great, so I will try again with SAS support.
Thanks Andrew. I feel like if there were a problem with the ODBC driver Snowflake provides, there would be LOTS of people complaining over at stack overflow etc. When I search for this issue, I don't see many complaints. And my guess is that among SAS folks reading from Snowflake, most folks have SAS/ACCESS to Snowflake, so might not have been tripped up by this issue. I feel like it's probably a SAS problem. They probably won't be able to fix it any time soon, but maybe some brilliant person in tech support will come up with an interesting workaround.
What happens when the UTF-8 character in Snowflake is NOT one that can be mapped into that single byte encoding? What does that map into in SAS? What do your other methods of accessing Snowflake retrieve?
Here are screenshots from yesterday comparing SAS Unicode mode and Excel reading the same Snowflake table via the Snowflake ODBC driver. (Excel used a DSN, while in SAS Unicode mode I tested both DSN and connection string with no difference in results.) Characters such as CJK, Russian, and Thai show as boxes in SAS.
Yesterday I also tested Python with pyodbc to Snowflake via DSN, and the result was correct.
A year ago I tested the ODBC Utility in Microsoft's MDAC, but the Unicode was corrupt. A year ago, I also tested Excel, and the Unicode was corrupt. I am not sure what changed for Excel, but it worked better yesterday than a year ago.
Hi @AndrewZ
Can you try running this macro against your zzz_unicode snowflake table
/****************************************************************************************************
File name: db_getmaxtextcolslengths.sas
Parent App:
File Type: Macro function
Purpose: Generate a SAS data containing the required SAS column lengths for a specified SnowFlake table
Inputs: p_inLibRef
p_inDbTableName
p_inDbSchema
p_inDbCatalog
p_outDsName
Outputs: Required SAS Columns lengths for specified snowflake table
****************************************************************************************************/
%MACRO db_getMaxTextColsLengths(
p_inLibRef=
, p_inDbTableName=
, p_inDbSchema=
, p_inDbCatalog=
, p_outDsName=);
%LOCAL
l_outLibName
l_outDsName
l_dbCat
l_dbSchema
l_dbTable
l_colNamesDS
;
%let l_outLibName = WORK;
%let l_outDsName = &p_outDsName;
%if (%INDEX(&p_outDsName, %STR(.)) GT 0) %then
%do;
%let l_outLibName = %SCAN(&p_outDsName,1,%STR(.));
%let l_outDsName = %SCAN(&p_outDsName,2,%STR(.));
%end;
/* Add required single quotes */
%let l_dbCat = %str(%')&p_inDbCatalog.%str(%');
%let l_dbSchema = %str(%')&p_inDbSchema.%str(%');
%let l_dbTable = %str(%')&p_inDbTableName.%str(%');
%let l_colNamesDS = WORK._tmp_col_names;
PROC SQL buffersize=50M noprint noerrorstop exec;
connect using &p_inLibRef as db;
CREATE TABLE &l_colNamesDS AS
SELECT
snow_col_len length=100 FORMAT=$100.
,sas_col_len length=200 FORMAT=$200.
,unicode_char_cnt length=200 FORMAT=$200.
,col_name
,col_type
FROM connection to db
(
SELECT
CASE WHEN DATA_TYPE <> 'TEXT' THEN ', '||TO_CHAR(COALESCE(NUMERIC_PRECISION,DATETIME_PRECISION,9),'FM999')||' ' ELSE
', MAX(LENGTH('||COLUMN_NAME||'))' END
|| ' AS snow_V' ||TO_CHAR(ORDINAL_POSITION,'FM999')||'_len' AS snow_col_len
,CASE WHEN DATA_TYPE <> 'TEXT' THEN ', 8' ELSE
', MAX(LENGTH('||COLUMN_NAME||') + (LENGTH(REGEXP_REPLACE('||COLUMN_NAME||', ''[\\x20-\\x7F]'', ''''))* 3))' END
|| ' AS sas_V' ||TO_CHAR(ORDINAL_POSITION,'FM999')||'_len' AS sas_col_len
,CASE WHEN DATA_TYPE <> 'TEXT' THEN ', 0' ELSE
', MAX(LENGTH(REGEXP_REPLACE('||COLUMN_NAME||', ''[\\x20-\\x7F]'', '''')))' END
|| ' AS V' ||TO_CHAR(ORDINAL_POSITION,'FM999')||'_unicode_char_cnt' AS unicode_char_cnt
,COLUMN_NAME as col_name
,DATA_TYPE as col_type
FROM
"&p_inDbCatalog"."INFORMATION_SCHEMA"."COLUMNS"
WHERE TABLE_CATALOG = &l_dbCat
AND TABLE_SCHEMA = &l_dbSchema
AND TABLE_NAME = &l_dbTable
/*AND DATA_TYPE = 'TEXT'*/
ORDER BY ORDINAL_POSITION
);
disconnect from db;
QUIT;
FILENAME dyncode TEMP;
DATA _NULL_;
FILE dyncode lrecl=300;
PUT '/* Data generatd Proc Sql call */';
PUT 'PROC SQL; ';
PUT +3 "CONNECT USING &p_inLibRef AS db; ";
PUT +3 "CREATE TABLE &p_outDsName AS ";
PUT +3 'SELECT * ';
PUT +3 'FROM connection to db ( ';
PUT +6 'SELECT ';
PUT +9 "'&p_inDbCatalog..&p_inDbSchema..&p_inDbTableName' AS Full_table_name ";
DO UNTIL (eof);
SET &l_colNamesDS end=eof;
PUT +9 snow_col_len ;
PUT +9 sas_col_len ;
PUT +9 unicode_char_cnt;
END;
PUT +6 'FROM "&p_inDbCatalog"."&p_inDbSchema"."&p_inDbTableName" ); ';
PUT +3 'disconnect from db; ';
PUT 'QUIT; ';
PUT ;
PUT "DATA &p_outDsName(KEEP=Full_table_name col_name col_type snow_col_len sas_col_len max_unicode_char_cnt) ; ";
PUT +3 "if (0) then SET &l_colNamesDS (KEEP=col_name col_type); ";
PUT +3 "SET &p_outDsName; ";
PUT +3 'RETAIN Full_table_name; ';
PUT +3 'Array saslens {*} 5 sas_: ; ';
PUT +3 'Array snowLens {*} 5 snow_: ; ';
PUT +3 'Array uniLens {*} 5 v: ; ';
PUT +3 'LENGTH snow_col_len sas_col_len max_unicode_char_cnt 5; ';
PUT;
PUT +3 'do _rec = 1 by 1 until(eof); ';
PUT +6 "SET &l_colNamesDS (KEEP=col_name col_type) end=eof; ";
PUT +6 'snow_col_len = snowLens[_rec]; ';
PUT +6 'sas_col_len = sasLens[_rec]; ';
PUT +6 'max_unicode_char_cnt = uniLens[_rec]; ';
PUT +6 'OUTPUT; ';
PUT +3 'end; ';
PUT +3 'STOP; ';
PUT 'RUN; ';
RUN;
%include dyncode;
%MEND db_getMaxTextColsLengths;
%db_getMaxTextColsLengths(
p_inLibRef=<SnowFlake-Libref>
, p_inDbCatalog=<SnowFlake-Catalog>
, p_inDbSchema=<SnowFlake-Schema>
, p_inDbTableName=zzz_unicode
, p_outDsName=<LIB.>SAS-DS-NAME)
Examine/Open the generated data set to see how long the SAS Character variable(s) have to be to properly represent the contents. Most definitely SAS variable(s) will require larger lengths.
Hope this helps,
Ahmed
Hi @AhmedAl_Attar . Understood that encoding data in UTF-8 will require longer variables, but I think SAS should handle that gracefully, in a SAS session with UTF-8 encoding.
More to the point, the issue does not seem to be that variables are being truncated. Instead it looks like the SAS dataset which has a dataset attribute stating it is UTF-8 encoded, has values that are actually iso-8859 encoded. Please see this post: https://communities.sas.com/t5/SAS-Programming/Reading-Unicode-from-Snowflake-into-SAS-via-ODBC/m-p/.... I think that post shows that the O-umlaut character is encoded incorrectly in the SAS dataset. But the values are not being truncated.
Is your session encoding UTF-8?
proc options option = encoding;
run;
Quentin, I did extensive testing on this issue in SAS and other tools. In SAS, I used SAS Unicode (utf-8) and SAS English (wlatin1).
My workaround in SAS Unicode is to run PROC DATASETS like below every time I pull in data from Snowflake, but it only gives me iso-8859-1, which seems to be a limitation of the Snowflake ODBC driver.
proc datasets library=&lib noprint;
modify &ds / correctencoding='iso-8859-1';
quit;
@AndrewZ wrote:
Quentin, I did extensive testing on this issue in SAS and other tools. In SAS, I used SAS Unicode (utf-8) and SAS English (wlatin1).
My workaround in SAS Unicode is to run PROC DATASETS like below every time I pull in data from Snowflake, but it only gives me iso-8859-1, which seems to be a limitation of the Snowflake ODBC driver.
proc datasets library=&lib noprint;
modify &ds / correctencoding='iso-8859-1';
quit;
I am not sure what your second paragraph means.
Did you look at the hexcodes in the dataset? Were they the valid UTF-8 bytes you expected?
That PROC DATASETS code will just change the metadata attribute that indicates the encoding used to create the file. Changing the metadata about the encoding of the text in the dataset will not change what is in the dataset. It just tells future users of the data what to expect to find when they look at the data.
Did you look at the hexcodes in the dataset? Were they the valid UTF-8 bytes you expected?
You mean use a hex editor on the .sas7bdat file? No. Based on my other tests (like one in the next paragraph), Snowflake was not sending UTF-8.
That PROC DATASETS code will just change the metadata attribute that indicates the encoding used to create the file. Changing the metadata about the encoding of the text in the dataset will not change what is in the dataset. It just tells future users of the data what to expect to find when they look at the data.
Have you tried this Libname option?
DBCLIENT_MAX_BYTES= LIBNAME Statement Option
One think to keep in mind, Variable length in Snowflake are based on character count, while in SAS, they are based on Byte count!
Therefore what could be stored in Snowflake within a char/varchar (1) may require a SAS variable of length 2+ in order to correctly display the values.
Hope this helps
In Snowflake, I increased length to varchar(1000), and still in Snowflake, and I calculated len(text_sample). This screenshot of SAS shows maximum length of any string was 49, but in SAS, the text was still corrupt.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.