Solved: BULKLOAD Libname option for Hadoop?

jimbarbour · Posted 08-25-2020 12:00 AM

I have some data in a SAS data set that I want to upload to Hadoop (the MapR implementation not Couldera/Hortonworks). I am connecting to Hadoop using the ODBC engine. I don't have the Hadoop engine available to me; just ODBC.

As is, INSERT after INSERT after INSERT is being performed, one row at a time. While that might work for a few hundred or even a few thousand rows, I have about 30,000,000 rows (just one column, an ID number) that I want to upload. Doing an INSERT one row at a time will probably complete about the same time that our sun flames out and becomes a cold, dark cinder. Call me impatient.

I want to use the BULKLOAD option. I code BULKLOAD=YES on my Libname, and the Libname is assigned, no problem. However, when I actually run my code to load the data to Hadoop from SAS, I get an "ERROR: Unable to initialize bulk loader" error message in my SAS log.

I don't suppose anyone can offer me a bit of a hint here, can they? I'd really like to use the BULKLOAD option, but I can't seem to find any information on how to trace down why the bulk loader won't initialize.

Jim

SASKiwi · Posted 08-25-2020 12:59 AM

@jimbarbour - I don't know about Hadoop and bulk loading but have you tried the INSERTBUFF = option? This definitely speeds up loading traditional RDBMSs in my experience. Try values like 1000, 10000, 20000 etc and see if it helps.

View solution in original post

SASKiwi · Posted 08-25-2020 12:59 AM

@jimbarbour - I don't know about Hadoop and bulk loading but have you tried the INSERTBUFF = option? This definitely speeds up loading traditional RDBMSs in my experience. Try values like 1000, 10000, 20000 etc and see if it helps.

jimbarbour · Posted 08-25-2020 01:09 AM

Interesting. I will try that. I'm not sure what effect one way or the other it will have. I'm uploading a series of 20 character ID numbers (several million of them) to Hadoop, and SAS is feeding them to Hadoop one at a time in the form of an SQL Insert. I wonder if buffering will help given that each ID is a separate SQL query, but it certainly can't hurt to try.

Thank you,

Jim

jimbarbour · Posted 08-25-2020 02:09 PM

I created a support track with SAS. It may be that the ODBC engine just doesn't support bulk loading for Hadoop.

Jim

jimbarbour · Posted 08-26-2020 12:19 AM

I spoke with SAS tech support, and, as I feared, the ODBC engine does not support BULKLOAD.

One possible workaround may be to use Proc Hadoop to copy the contents of a SAS table to a file stored on HDFS and then run a Hive query such that the file can be read as a Hive table. If I get that working, I'll report back.

Jim

ChrisNZ · Posted 08-26-2020 04:15 AM

> I get an "ERROR: Unable to initialize bulk loader" error message in my SAS log.

This type of error message probably calls for tech Support to have a look.

>As is, INSERT after INSERT after INSERT is being performed,

Do you use proc append to add the data?

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 08-26-2020 08:38 PM

@ChrisNZ,

Yes, you're quite right; this calls for SAS tech support. I did open a track with them, and they replied that the BULKLOAD option is not supported when using the ODBC engine. One must license the Hadoop engine in order to use the BULKLOAD option.

I was using a very simple DATA step to transfer the data from SAS to Hadoop. To wit:

DATA	OPSI_RSC.Hic_Numbers_Bulk_JB;
	SET	WORK.MEMBERS_JB	(KEEP=hic_num	OBS=&Obs)
	;
RUN;

The Libname for OPSI_RSC has the BULKLOAD=YES option coded on it.

Jim

ChrisNZ · Posted 08-27-2020 01:18 AM

It seems to me that the error message should be different ("unsupported option") and be delivered when the library is defined.

As it is, it is extremely misleading, to the point of a defect imho.

proc append might be faster as it does not process anything (no PDV loading etc) and does not need to read one record at a time.

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 08-27-2020 11:02 AM

@ChrisNZ wrote:

It seems to me that the error message should be different ("unsupported option") and be delivered when the library is defined.

As it is, it is extremely misleading, to the point of a defect imho.

@ChrisNZ, agreed. Such an error message should be a) clear and b) issued at the time of allocation. SAS seems a bit lax about such things these days. I recently got a "read access violation" error that advised me to contact SAS Tech Support. They basically just shrugged it off.

In my opinion, good software does not crash or get memory/access violations. Well written software handles internal errors just a bit more gracefully. Based on the last three or so years of experience, I am not at all impressed with how SAS is handling Hadoop. I hope SAS makes it in the brave new world of the Cloud and "Big Data." They seem a bit, well, sloppy thus far.

ERROR:  An exception has been encountered.
Please contact technical support and provide them with the following traceback information:
 
The SAS task name is [SQL]
ERROR:  Read Access Violation SQL
Exception occurred at (802742F8)
Task Traceback
Address   Frame     (DBGHELP API Version 4.0 rev 5)
00000001802742F8  0000000007F2DFE0  MapRHiveODBC64:ConfigDSN+0x258868
... [redacted for brevity]

@ChrisNZ wrote:

proc append might be faster as it does not process anything (no PDV loading etc) and does not need to read one record at a time.

I'll try it, but really this isn't an issue of how SAS does things (as it would most definitely be were I to be using SAS data sets). What really matters here is what SAS passes to Hadoop/Hive. I suspect it will be a series of INSERT's, but it's worth a try. I think the better work-around will be to upload a file to HDFS and then "define" it as a table via Hive. I just need to now figure out the specifics of how to do that, but, in theory, that would be the best way.

@ChrisNZ, do you have occasion to work with Hadoop? We've been experimenting with an alternative to Hive, something called Presto. As Hive sits on top of Hadoop, so also does Presto, but Presto is several orders of magnitude faster based on the testing we've done. I could send you my write up of my tests if you were interested. I had not heretofore heard of Presto and was not even aware that there was an alternative to Hive. I am quite impressed with Presto so far.

Jim

ChrisNZ · Posted 08-28-2020 12:52 AM

Yes I used to use SAS and Hadoop.

We ended up using Hive to copy data to Hadoop (using proc hadoop), and Impala to query the data, as these were fastest for their respective tasks.

> They basically just shrugged it off.
Shame.

Not surprised that proc append issues insert statements, but we had to try.

@SASKiwi What does using option INSERTBUFF actually do? Does it not issue INSERT statements? How can it insert several rows at once?

@jimbarbour Option INSERTBUFF seems to require option DBCOMMIT to be set too. Have you done this?

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 08-28-2020 01:20 AM

@ChrisNZ and @SASKiwi,

From the SAS documentation, the following databases support INSERTBUFF:

Data source:	Amazon Redshift, Aster, DB2 under UNIX and PC Hosts, Google BigQuery, Greenplum, HAWQ, Impala, JDBC, Microsoft SQL Server, MySQL, Netezza, ODBC, OLE DB, Oracle, PostgreSQL, SAP HANA, SAP IQ, Snowflake, Vertica, Yellowbrick

I see Impala which we don't have, but I don't see Hive. I suspect that INSERTBUFF may not have any effect on a Hive query.

Jim

P.S. Perhaps this is a consequence of our not licensing the Hadoop specific engine (we only have ODBC), but when I tried to run Proc Hadoop, I got a message that the procedure was not found. I think I'm just out of luck in terms of SAS. If I'm going to upload lists of variables like ID numbers for joins, it looks like I'm going to have to do it outside of SAS.

ERROR: Procedure HADOOP not found.

Jim

ChrisNZ · Posted 08-28-2020 03:13 AM

> Perhaps this is a consequence of our not licensing the Hadoop specific engine

Yes it is.

I think your best bet is to 1) transfer flat files to HDFS, and then 2) import the files.

Probably via 1) data step+file or fcopy(), and then 2) proc sql+execute by

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 08-28-2020 03:37 AM

> I see Impala which we don't have, but I don't see Hive. I suspect that INSERTBUFF may not have any effect on a Hive query.

ODBC is there, which you are using, so we could have expected an effect.

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 08-28-2020 02:15 PM

@ChrisNZ,

(and hopefully of interest to @SASKiwi too)

You brought up the point of DBCOMMIT. This is a very valid point. Thank you.

I've now run a series of tests. First, I parameterized the buffer and DBcommit settings on my Libname and put in a log statement to keep track of them for testing. Note the DBcommit parameter below:

       +-----------------------------------------------------+
NOTE:  | Lib=OPSI_RSC allocated with buffers (Buffs=YES)     |
       | ReadBuff=10,000                                     |
       | InsertBuff=10,000                                   |
       | DBcommit=100,000                                    |
       +-----------------------------------------------------+

Next, I set my Obs parameter to 100,000 and run. Below is the SASTRACE.

ODBC_7: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_8: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_9: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_10: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_11: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_12: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_13: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_14: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_15: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_16: Executed: on connection 2
Prepared statement ODBC_6

NOTE: There were 100000 observations read from the data set WORK.MEMBERS_JB.
NOTE: The data set OPSI_RSC.HIC_NUMBERS_BULK_BUFF_JB has 100000 observations and 1 variables.
 
Summary Statistics for ODBC are:
Total SQL execution seconds were:                 212.398212
Total SQL prepare seconds were:                     0.456196
Total seconds used by the ODBC ACCESS engine were   213.266036

Note in the above SASTRACE that the "prepared statement" is executed ten times. If my DBCOMMIT is 100,000 and my Obs parameter is 100,000, shouldn't the prepared statement be executed only once? Actually, no. SAS will take the lesser of the INSERTBUFF and the DBCOMMIT parameters. Thus an INSERTBUFF of 10,000, and a DBCOMMIT of 100,000 results in a Map/Reduce jobs in Hadoop running with 10,000 rows each. Note that if you do not code the DBCOMMIT parameter at all, DBCOMMIT will default to 1,000, even if INSERTBUFF = 32767. In order to get the fastest upload speed, DBCOMMIT and INSERTBUFF must be used in conjunction with one another.

When using INSERTBUFF in conjunction with DBCOMMIT, uploading to Hadoop is roughly 32,000 times faster than when coding no buffer or commit parameters. Not too shabby for a parameter change!

Now, for my putative 10,000,000 row upload, at 20 seconds per commit with INSERTBUFF and DBCOMMIT set to the max, 32767, it would take about 2 hours to run, which isn't half bad. Using INSERTBUFF in conjunction with DBCOMMIT is a very workable method of uploading to Hadoop.

@SASKiwi, you suggested INSERTBUFF, so perforce I award the solution to you, but to @ChrisNZ, I thank you for your DBCOMMIT suggestion. By changing DBCOMMIT from its default of 1,000 to its maximum of 32,767, I should be able to gain a roughly thirtyfold increase in speed over the default.

Jim

SASKiwi · Posted 08-28-2020 05:06 PM

@jimbarbour - Good to hear you have a solution. I did think of DBCOMMIT too but didn't put it in my response.

BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Registration is open

SAS Training: Just a Click Away