@ChrisNZ wrote:
You set insertbuff to 10,000. Have you tried higher values?
Yes, I have, and as I mentioned in my previous replies, having INSERTBUFF and DBCOMMIT both set to 32,767 is the fastest setting, at least in my environment. The Hadoop job takes about the same amount of time whether 1 row or 32,767 rows are inserted at a time, therefore setting INSERTBUFF and DBCOMMIT to there maximum values is the fastest setting. DBCOMMIT could be set to a larger value, but SAS takes the lesser of INSERTBUFF and DBCOMMIT, and since INSERTBUFF has a maximum valid value of 32,767, DBCOMMIT settings beyond 32,767 will be ignored.
Jim
No luck. As I suspected (see previous reply), SAS is translating Proc Append into a series of INSERT statements (see below screen print from the Hadoop Resource Manager) when it communicates with Hive. In effect, there is no difference between a Data step and a Proc Append in terms of how SAS interacts with Hive.
Jim
@jimbarbour - Did you try INSERTBUFF? It is definitely an ODBC allowable option.
The way that SAS works is that it breaks each record in the SAS data set to be uploaded into a separate Map/Reduce job. That's one Hadoop Map/Reduce job per row. Say I have 10,000,000 ID numbers I want to upload, one per row. That means that there will be 10,000,000 separate jobs. If each job runs in 15 - 20 seconds each (which is about what they've been running), it would take about 5 years to do all the inserts. Now, let's say we can use INSERTBUFF and reduce it to 5 seconds per insert. The inserts would take about 1.5 years. Even if we reduced it to 1 second per insert, the inserts would take more than 3 months. I will try INSERTBUFF, but I suspect that while it may help, it won't make the upload run in a practical amount of time.
Jim
Well, I'll be danged. Not only did INSERTBUFF make a difference, it totally changed SAS's behavior.
First the time stamps. This was for a run of 250 records (infinitesimal by normal SAS standards).
Without INSERTBUFF:
NOTE: DATA statement used (Total process time): real time 1:25:03.04 cpu time 0:00:01.32
With INSERTBUFF:
NOTE: DATA statement used (Total process time): real time 0:00:21.38 cpu time 0:00:00.15
Basically, it took an hour and a half without INSERTBUFF versus only 21 seconds with INSERTBUFF, a nearly unbelievable difference!
The SASTRACE tells the full story. Take a close look at the following.
Without INSERTBUFF:
ODBC_5: Executed: on connection 2 CREATE TABLE Informatics_Prd.HIC_NUMBERS_BULK_JB (hic_num VARCHAR(20)) ODBC_6: Prepared: on connection 2 INSERT INTO Informatics_Prd.HIC_NUMBERS_BULK_JB (hic_num) VALUES ( ? ) ODBC_7: Executed: on connection 2 Prepared statement ODBC_6 ODBC_8: Executed: on connection 2 Prepared statement ODBC_6 ODBC_9: Executed: on connection 2 Prepared statement ODBC_6 ...ODBC messages 10 - 253 redacted for brevity... ODBC_254: Executed: on connection 2 Prepared statement ODBC_6 ODBC_255: Executed: on connection 2 Prepared statement ODBC_6 ODBC_256: Executed: on connection 2 Prepared statement ODBC_6 NOTE: There were 250 observations read from the data set WORK.MEMBERS_JB. NOTE: The data set OPSI_RSC.HIC_NUMBERS_BULK_JB has 250 observations and 1 variables. Summary Statistics for ODBC are: Total SQL execution seconds were: 5100.644141 Total SQL prepare seconds were: 0.489638 Total seconds used by the ODBC ACCESS engine were 5101.169617
With INSERTBUFF:
ODBC_3: Executed: on connection 2 CREATE TABLE Informatics_Prd.HIC_NUMBERS_BULK_BUFF_JB (hic_num VARCHAR(20)) ODBC_4: Prepared: on connection 2 INSERT INTO Informatics_Prd.HIC_NUMBERS_BULK_BUFF_JB (hic_num) VALUES ( ? ) NOTE: There were 250 observations read from the data set WORK.MEMBERS_JB. ODBC_5: Executed: on connection 2 Prepared statement ODBC_4 NOTE: The data set OPSI_RSC.HIC_NUMBERS_BULK_BUFF_JB has 250 observations and 1 variables. Summary Statistics for ODBC are: Total SQL execution seconds were: 20.292003 Total SQL prepare seconds were: 0.507490 Total seconds used by the ODBC ACCESS engine were 21.037232
Do you see the difference? Without INSERTBUFF, 250 separate communications are made via ODBC with Hadoop, and 250 separate Map/Reduce jobs are spawned in Hadoop. It takes a very long time.
With INSERTBUFF, message "ODBC_5" states that the statement prepared in ODBC_4 was executed, and, from the rest of the trace, we can see that it was executed just one time. Only one Map/Reduce job was spawned in Hadoop, and it finished in seconds.
Hallelujah! This is quite good. Now, I need to try INSERTBUFF with larger sample sizes. As I recall the max for INSERTBUFF is 32767.
And, @SASKiwi, I apologize that I ever doubted you. I thought INSERTBUFF might make a difference, but I didn't think that INSERTBUFF would fundamentally change the way that SAS interacts with Hadoop. @ChrisNZ, thank you for your encouragement. I sometimes need a bit of a nudge to get over my preconceived notions.
Jim
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.