Greetings! First time caller, long time listener. Was hoping you all could help me with some field length issues we're having and, as a follow on, help determine why certain table properties won't pass on implcit SQL but do on explicit SQL. My company uses Cloudera/Impala (ODBC version 2.5.29, ImpalaD 2.1.3-cdh5.3) to access Hadoop via SAS/ACCESS. We're currently running SAS 9.4 and have the STAT 13.2 and ETS 13.2 analytical packs. Consider a Hadoop View (VIEW_A) that has a field (FIELD_Z) defined as VARCHAR(10). I run the following code: LIBNAME IMP SASIOIMP SERVER="server" PORT=21050 USER=&HAD_USER PASSWORD=&HAD_PW DATABASE=DEV_SL; 68 PROC SQL; 69 CREATE TABLE ORDER_HDR_IMP_1 AS 70 SELECT 71 FIELD_Z 72 FROM VIEW_A; 73 QUIT; NOTE: Compressing data set WORK.ORDER_HDR_IMP_1 decreased size by 99.78 percent. Compressed is 47 pages; un-compressed would require 21144 pages. NOTE: Table WORK.ORDER_HDR_IMP_1 created, with 148002 rows and 1 columns. 89 QUIT; NOTE: PROCEDURE SQL used (Total process time): real time 11.09 seconds cpu time 8.48 seconds When I note the field length of FIELD_Z, it is $32767...even though the view is defined as VARCHAR(10). Now, I try the explicit SQL method: 90 PROC SQL; 91 CONNECT TO SASIOIMP (DATABASE=DEV_SL USER=&HAD_USER PASSWORD="&HAD_PW" 91 ! SERVER="server" SQL_FUNCTIONS=ALL); 92 CREATE TABLE ORDER_HDR_EXP_2 AS 93 SELECT * FROM CONNECTION TO SASIOIMP 94 (SELECT FIELD_Z 95 FROM VIEW_A); 96 QUIT; NOTE: Compression was disabled for data set WORK.ORDER_HDR_EXP_2 because compression overhead would increase the size of the data set. IMPALA_8: Executed: on connection 2 Prepared statement IMPALA_7 NOTE: Table WORK.ORDER_HDR_EXP_2 created, with 148002 rows and 1 columns. 114 QUIT; NOTE: PROCEDURE SQL used (Total process time): real time 1.29 seconds cpu time 0.24 seconds The above results in a proper length of FIELD_Z of $10. Strange. I can note that the query returned the exact same records, but ran in 90% less time and took up 99.99% less space (before compression). We are currently not using the SASFMT metadata option as our IT department is a little.....hesitant.....to add this metadata tag, and I have not be able to find much on the web regarding its usage. Our Hadoop install is pretty new (traditionally a DB2 shop, Teradata now as well), so we're still working out the kinks. Right now, we've got a handful of tables with the largest having about 17M records and 25 columns, relatively small compared to our DB2/Teradata tables. We use ETL to load the Hadoop tables and SAS to perform analytics. However, pulling data from these tables is extremely inefficient due to these field length issues. So a few questions: 1. Why the differences in dataset field length between Implicit SQL and Explicit SQL? 2. Anybody currently using the SASFMT metadata option? If so, any concerns/issues? Is it meeting your requirements? 3. For those not using SASFMT, how do you get around the $32767 default length for STRING fields? Thanks in advance, appreciate the help! Feel free to ask for any additional information and I'm happy to provide whatever I can. Matt-
... View more