Solved: Re: BULKLOAD Libname option for Hadoop? - Page 2

ChrisNZ · Posted 08-28-2020 07:48 PM

You set insertbuff to 10,000. Have you tried higher values?

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 08-28-2020 10:09 PM

@ChrisNZ wrote:
You set insertbuff to 10,000. Have you tried higher values?

Yes, I have, and as I mentioned in my previous replies, having INSERTBUFF and DBCOMMIT both set to 32,767 is the fastest setting, at least in my environment. The Hadoop job takes about the same amount of time whether 1 row or 32,767 rows are inserted at a time, therefore setting INSERTBUFF and DBCOMMIT to there maximum values is the fastest setting. DBCOMMIT could be set to a larger value, but SAS takes the lesser of INSERTBUFF and DBCOMMIT, and since INSERTBUFF has a maximum valid value of 32,767, DBCOMMIT settings beyond 32,767 will be ignored.

Jim

jimbarbour · Posted 08-27-2020 12:49 PM

@ChrisNZ,

No luck. As I suspected (see previous reply), SAS is translating Proc Append into a series of INSERT statements (see below screen print from the Hadoop Resource Manager) when it communicates with Hive. In effect, there is no difference between a Data step and a Proc Append in terms of how SAS interacts with Hive.

Jim

SASKiwi · Posted 08-27-2020 04:25 PM

@jimbarbour - Did you try INSERTBUFF? It is definitely an ODBC allowable option.

jimbarbour · Posted 08-27-2020 10:58 PM

@SASKiwi,

The way that SAS works is that it breaks each record in the SAS data set to be uploaded into a separate Map/Reduce job. That's one Hadoop Map/Reduce job per row. Say I have 10,000,000 ID numbers I want to upload, one per row. That means that there will be 10,000,000 separate jobs. If each job runs in 15 - 20 seconds each (which is about what they've been running), it would take about 5 years to do all the inserts. Now, let's say we can use INSERTBUFF and reduce it to 5 seconds per insert. The inserts would take about 1.5 years. Even if we reduced it to 1 second per insert, the inserts would take more than 3 months. I will try INSERTBUFF, but I suspect that while it may help, it won't make the upload run in a practical amount of time.

Jim

jimbarbour · Posted 08-28-2020 01:07 PM

@SASKiwi,

Well, I'll be danged. Not only did INSERTBUFF make a difference, it totally changed SAS's behavior.

First the time stamps. This was for a run of 250 records (infinitesimal by normal SAS standards).

Without INSERTBUFF:

NOTE: DATA statement used (Total process time):
      real time           1:25:03.04
      cpu time            0:00:01.32

With INSERTBUFF:

NOTE: DATA statement used (Total process time):
      real time           0:00:21.38
      cpu time            0:00:00.15

Basically, it took an hour and a half without INSERTBUFF versus only 21 seconds with INSERTBUFF, a nearly unbelievable difference!

The SASTRACE tells the full story. Take a close look at the following.

Without INSERTBUFF:

ODBC_5: Executed: on connection 2
CREATE TABLE Informatics_Prd.HIC_NUMBERS_BULK_JB (hic_num VARCHAR(20))
 
ODBC_6: Prepared: on connection 2
INSERT INTO Informatics_Prd.HIC_NUMBERS_BULK_JB (hic_num)  VALUES ( ? )
 
ODBC_7: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_8: Executed: on connection 2
Prepared statement ODBC_6
 
ODBC_9: Executed: on connection 2
Prepared statement ODBC_6

...ODBC messages 10 - 253 redacted for brevity...

ODBC_254: Executed: on connection 2
Prepared statement ODBC_6
 
 ODBC_255: Executed: on connection 2
Prepared statement ODBC_6
 
 ODBC_256: Executed: on connection 2
Prepared statement ODBC_6
 
NOTE: There were 250 observations read from the data set WORK.MEMBERS_JB.
NOTE: The data set OPSI_RSC.HIC_NUMBERS_BULK_JB has 250 observations and 1 variables.
 
Summary Statistics for ODBC are:
Total SQL execution seconds were:                 5100.644141
Total SQL prepare seconds were:                     0.489638
Total seconds used by the ODBC ACCESS engine were   5101.169617

With INSERTBUFF:

ODBC_3: Executed: on connection 2
CREATE TABLE Informatics_Prd.HIC_NUMBERS_BULK_BUFF_JB (hic_num VARCHAR(20))
 
ODBC_4: Prepared: on connection 2
INSERT INTO Informatics_Prd.HIC_NUMBERS_BULK_BUFF_JB (hic_num)  VALUES ( ? )
 
NOTE: There were 250 observations read from the data set WORK.MEMBERS_JB.
 
ODBC_5: Executed: on connection 2
Prepared statement ODBC_4
 
NOTE: The data set OPSI_RSC.HIC_NUMBERS_BULK_BUFF_JB has 250 observations and 1 variables.
 
Summary Statistics for ODBC are:
Total SQL execution seconds were:                  20.292003
Total SQL prepare seconds were:                     0.507490
Total seconds used by the ODBC ACCESS engine were    21.037232

Do you see the difference? Without INSERTBUFF, 250 separate communications are made via ODBC with Hadoop, and 250 separate Map/Reduce jobs are spawned in Hadoop. It takes a very long time.

With INSERTBUFF, message "ODBC_5" states that the statement prepared in ODBC_4 was executed, and, from the rest of the trace, we can see that it was executed just one time. Only one Map/Reduce job was spawned in Hadoop, and it finished in seconds.

Hallelujah! This is quite good. Now, I need to try INSERTBUFF with larger sample sizes. As I recall the max for INSERTBUFF is 32767.

And, @SASKiwi, I apologize that I ever doubted you. I thought INSERTBUFF might make a difference, but I didn't think that INSERTBUFF would fundamentally change the way that SAS interacts with Hadoop. @ChrisNZ, thank you for your encouragement. I sometimes need a bit of a nudge to get over my preconceived notions.

Jim

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Re: BULKLOAD Libname option for Hadoop?

Registration is open

SAS Training: Just a Click Away