I have some data in a SAS data set that I want to upload to Hadoop (the MapR implementation not Couldera/Hortonworks). I am connecting to Hadoop using the ODBC engine. I don't have the Hadoop engine available to me; just ODBC.
As is, INSERT after INSERT after INSERT is being performed, one row at a time. While that might work for a few hundred or even a few thousand rows, I have about 30,000,000 rows (just one column, an ID number) that I want to upload. Doing an INSERT one row at a time will probably complete about the same time that our sun flames out and becomes a cold, dark cinder. Call me impatient.
I want to use the BULKLOAD option. I code BULKLOAD=YES on my Libname, and the Libname is assigned, no problem. However, when I actually run my code to load the data to Hadoop from SAS, I get an "ERROR: Unable to initialize bulk loader" error message in my SAS log.
I don't suppose anyone can offer me a bit of a hint here, can they? I'd really like to use the BULKLOAD option, but I can't seem to find any information on how to trace down why the bulk loader won't initialize.
Jim
@jimbarbour - I don't know about Hadoop and bulk loading but have you tried the INSERTBUFF = option? This definitely speeds up loading traditional RDBMSs in my experience. Try values like 1000, 10000, 20000 etc and see if it helps.
@jimbarbour - I don't know about Hadoop and bulk loading but have you tried the INSERTBUFF = option? This definitely speeds up loading traditional RDBMSs in my experience. Try values like 1000, 10000, 20000 etc and see if it helps.
I spoke with SAS tech support, and, as I feared, the ODBC engine does not support BULKLOAD.
One possible workaround may be to use Proc Hadoop to copy the contents of a SAS table to a file stored on HDFS and then run a Hive query such that the file can be read as a Hive table. If I get that working, I'll report back.
Jim
> I get an "ERROR: Unable to initialize bulk loader" error message in my SAS log.
This type of error message probably calls for tech Support to have a look.
>As is, INSERT after INSERT after INSERT is being performed,
Do you use proc append to add the data?
Yes, you're quite right; this calls for SAS tech support. I did open a track with them, and they replied that the BULKLOAD option is not supported when using the ODBC engine. One must license the Hadoop engine in order to use the BULKLOAD option.
I was using a very simple DATA step to transfer the data from SAS to Hadoop. To wit:
DATA OPSI_RSC.Hic_Numbers_Bulk_JB;
SET WORK.MEMBERS_JB (KEEP=hic_num OBS=&Obs)
;
RUN;
The Libname for OPSI_RSC has the BULKLOAD=YES option coded on it.
Jim
It seems to me that the error message should be different ("unsupported option") and be delivered when the library is defined.
As it is, it is extremely misleading, to the point of a defect imho.
proc append might be faster as it does not process anything (no PDV loading etc) and does not need to read one record at a time.
@ChrisNZ wrote:
It seems to me that the error message should be different ("unsupported option") and be delivered when the library is defined.
As it is, it is extremely misleading, to the point of a defect imho.
@ChrisNZ, agreed. Such an error message should be a) clear and b) issued at the time of allocation. SAS seems a bit lax about such things these days. I recently got a "read access violation" error that advised me to contact SAS Tech Support. They basically just shrugged it off.
In my opinion, good software does not crash or get memory/access violations. Well written software handles internal errors just a bit more gracefully. Based on the last three or so years of experience, I am not at all impressed with how SAS is handling Hadoop. I hope SAS makes it in the brave new world of the Cloud and "Big Data." They seem a bit, well, sloppy thus far.
ERROR: An exception has been encountered. Please contact technical support and provide them with the following traceback information: The SAS task name is [SQL] ERROR: Read Access Violation SQL Exception occurred at (802742F8) Task Traceback Address Frame (DBGHELP API Version 4.0 rev 5) 00000001802742F8 0000000007F2DFE0 MapRHiveODBC64:ConfigDSN+0x258868 ... [redacted for brevity]
@ChrisNZ wrote:
proc append might be faster as it does not process anything (no PDV loading etc) and does not need to read one record at a time.
I'll try it, but really this isn't an issue of how SAS does things (as it would most definitely be were I to be using SAS data sets). What really matters here is what SAS passes to Hadoop/Hive. I suspect it will be a series of INSERT's, but it's worth a try. I think the better work-around will be to upload a file to HDFS and then "define" it as a table via Hive. I just need to now figure out the specifics of how to do that, but, in theory, that would be the best way.
@ChrisNZ, do you have occasion to work with Hadoop? We've been experimenting with an alternative to Hive, something called Presto. As Hive sits on top of Hadoop, so also does Presto, but Presto is several orders of magnitude faster based on the testing we've done. I could send you my write up of my tests if you were interested. I had not heretofore heard of Presto and was not even aware that there was an alternative to Hive. I am quite impressed with Presto so far.
Jim
Yes I used to use SAS and Hadoop.
We ended up using Hive to copy data to Hadoop (using proc hadoop), and Impala to query the data, as these were fastest for their respective tasks.
> They basically just shrugged it off.
Shame.
Not surprised that proc append issues insert statements, but we had to try.
@SASKiwi What does using option INSERTBUFF actually do? Does it not issue INSERT statements? How can it insert several rows at once?
@jimbarbour Option INSERTBUFF seems to require option DBCOMMIT to be set too. Have you done this?
From the SAS documentation, the following databases support INSERTBUFF:
Data source: | Amazon Redshift, Aster, DB2 under UNIX and PC Hosts, Google BigQuery, Greenplum, HAWQ, Impala, JDBC, Microsoft SQL Server, MySQL, Netezza, ODBC, OLE DB, Oracle, PostgreSQL, SAP HANA, SAP IQ, Snowflake, Vertica, Yellowbrick |
---|
I see Impala which we don't have, but I don't see Hive. I suspect that INSERTBUFF may not have any effect on a Hive query.
Jim
P.S. Perhaps this is a consequence of our not licensing the Hadoop specific engine (we only have ODBC), but when I tried to run Proc Hadoop, I got a message that the procedure was not found. I think I'm just out of luck in terms of SAS. If I'm going to upload lists of variables like ID numbers for joins, it looks like I'm going to have to do it outside of SAS.
ERROR: Procedure HADOOP not found.
Jim
> Perhaps this is a consequence of our not licensing the Hadoop specific engine
Yes it is.
I think your best bet is to 1) transfer flat files to HDFS, and then 2) import the files.
Probably via 1) data step+file or fcopy(), and then 2) proc sql+execute by
> I see Impala which we don't have, but I don't see Hive. I suspect that INSERTBUFF may not have any effect on a Hive query.
ODBC is there, which you are using, so we could have expected an effect.
(and hopefully of interest to @SASKiwi too)
You brought up the point of DBCOMMIT. This is a very valid point. Thank you.
I've now run a series of tests. First, I parameterized the buffer and DBcommit settings on my Libname and put in a log statement to keep track of them for testing. Note the DBcommit parameter below:
+-----------------------------------------------------+ NOTE: | Lib=OPSI_RSC allocated with buffers (Buffs=YES) | | ReadBuff=10,000 | | InsertBuff=10,000 | | DBcommit=100,000 | +-----------------------------------------------------+
Next, I set my Obs parameter to 100,000 and run. Below is the SASTRACE.
ODBC_7: Executed: on connection 2 Prepared statement ODBC_6 ODBC_8: Executed: on connection 2 Prepared statement ODBC_6 ODBC_9: Executed: on connection 2 Prepared statement ODBC_6 ODBC_10: Executed: on connection 2 Prepared statement ODBC_6 ODBC_11: Executed: on connection 2 Prepared statement ODBC_6 ODBC_12: Executed: on connection 2 Prepared statement ODBC_6 ODBC_13: Executed: on connection 2 Prepared statement ODBC_6 ODBC_14: Executed: on connection 2 Prepared statement ODBC_6 ODBC_15: Executed: on connection 2 Prepared statement ODBC_6 ODBC_16: Executed: on connection 2 Prepared statement ODBC_6
NOTE: There were 100000 observations read from the data set WORK.MEMBERS_JB. NOTE: The data set OPSI_RSC.HIC_NUMBERS_BULK_BUFF_JB has 100000 observations and 1 variables. Summary Statistics for ODBC are: Total SQL execution seconds were: 212.398212 Total SQL prepare seconds were: 0.456196 Total seconds used by the ODBC ACCESS engine were 213.266036
Note in the above SASTRACE that the "prepared statement" is executed ten times. If my DBCOMMIT is 100,000 and my Obs parameter is 100,000, shouldn't the prepared statement be executed only once? Actually, no. SAS will take the lesser of the INSERTBUFF and the DBCOMMIT parameters. Thus an INSERTBUFF of 10,000, and a DBCOMMIT of 100,000 results in a Map/Reduce jobs in Hadoop running with 10,000 rows each. Note that if you do not code the DBCOMMIT parameter at all, DBCOMMIT will default to 1,000, even if INSERTBUFF = 32767. In order to get the fastest upload speed, DBCOMMIT and INSERTBUFF must be used in conjunction with one another.
When using INSERTBUFF in conjunction with DBCOMMIT, uploading to Hadoop is roughly 32,000 times faster than when coding no buffer or commit parameters. Not too shabby for a parameter change!
Now, for my putative 10,000,000 row upload, at 20 seconds per commit with INSERTBUFF and DBCOMMIT set to the max, 32767, it would take about 2 hours to run, which isn't half bad. Using INSERTBUFF in conjunction with DBCOMMIT is a very workable method of uploading to Hadoop.
@SASKiwi, you suggested INSERTBUFF, so perforce I award the solution to you, but to @ChrisNZ, I thank you for your DBCOMMIT suggestion. By changing DBCOMMIT from its default of 1,000 to its maximum of 32,767, I should be able to gain a roughly thirtyfold increase in speed over the default.
Jim
@jimbarbour - Good to hear you have a solution. I did think of DBCOMMIT too but didn't put it in my response.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.