Improving Federation Server Insert Performance with the BULKLOAD Option

1 Like

Before you can analyze your data, you first need to load it. SAS® Federation Server 4.4 brings a new way to speed up your data inserts via new bulk-loading options.

During normal writes, the LIBNAME engine is usually forced to wait for each chunk of data to be successfully written before gathering the next chunk. The BULKLOAD= option allows the SAS LIBNAME engine for SAS® Federation Server to stream data to the server without waiting for confirmation that it has been written. Once the data has been completely written the log will show how many rows were successfully inserted as well as how many rows generated warnings or errors if there were problems.

Streaming the data to the server allows the LIBNAME engine to avoid having to wait for confirmation of each chunk and can often improve performance. The downside is that you won’t get notification about any errors or warnings until you are done, but this isn’t an issue if you are generally expecting the jobs to succeed.

But How Does It Work?

Consider a case where a DATA STEP or PROC SQL job is trying to insert 703 rows into a relational database like Oracle or DB2. The user set INSERTBUFF=100 to speed up performance, but they also want to use BULKLOAD=YES to be even faster.

In the example above, 300 rows have already been inserted into the database (A). The Federation Server (B) has cached 2 blocks of rows and will insert them as quickly as possible into the database server. The FEDSVR LIBNAME engine (C) also has cached 2 blocks of rows and will send them to the Federation Server as quickly as possible while MVA SAS (D) has the remaining 4. The fact that the FEDSVR LIBNAME engine and the Federation Server can buffer rows allows MVA SAS and the FEDSVR LIBNAME engine to process and insert rows without having to wait for them to be written to the database. Once the MVA SAS job is complete it will wait for all rows to be written before completing.

How Much Faster?

Adding two places where rows can be buffered means the pipeline is divided into 3 stages running in parallel: MVA SAS (D), the FEDSVR LIBNAME engine (C), and the Federation Server/Database Server (B/A). This means that the best possible performance would be somewhere around 1/3 of the usual time, but in practice the gains are less than that. In most cases one of the three stages will take considerably more time than the other two, in which case that stage’s time will be the limiting factor. In a test environment we usually see improvements between 20% to 40% when inserting large data into relational databases, and slightly better performance when inserting into BASE data sets.

Example

Enabling bulk-loading can be as simple as just adding the BULKLOAD=YES option to your LIBNAME statement.

libname output FEDSVR BULKLOAD=YES SERVER=fsserver PORT=24141 USER=fsuser PASSWORD=xxxxxxxx DSN=BASE;
data output.example; 
   set input.accounts;
      if idnum > 1000 then output; 
run;
NOTE: There were 100000 observations read from the data set INPUT.ACCOUNTS.
NOTE: The data set OUTPUT.EXAMPLE has 99000 observations and 152 variables.
NOTE: DATA statement used (Total process time):
      real time           0.95 seconds
      cpu time            0.81 seconds
NOTE: BulkLoad rows added=99000

The final note lets you know that bulk-load was able to successfully insert all 99,000 rows.

Options

In many cases the default bulk-load option values the LIBNAME engine picks will be sufficient, but you may want to consider one or more of the following options to improve performance or change behavior. In most situations INSERTBUFF= is the only option that will cause a significant performance difference.

BULKLOAD=<bulk-load-queue-size>

This option controls the number of blocks of rows that the bulk-loader will cache while waiting for the server to respond. The default is 4 when BULKLOAD=YES is specified which is usually sufficient, but there may be cases where increasing this will improve performance if the number of rows per block is low.

INSERTBUFF=<rows-per-insert>

This option tells the LIBNAME engine how many rows to write at a time. Larger values will use more memory but will generally improve performance.

BLROWS=<rows-per-insert>

This option tells the bulk-loader how many rows should be in each bulk-load block being streamed to the server. By default, this will be the value specified for INSERTBUFF= but this option allows you to specify a larger bulk-load block size if desired.

BLBYTES=<bytes-per-insert>

This option tells the bulk-loader the maximum size for each bulk-load block being streamed to the server. The value specified will be divided by the maximum row size for the table to calculate the number of rows for each bulk-load block. If BLROWS= is also specified, the larger of the two row sizes will be used.

BLCOMMIT=<number-of-rows-per-commit>

This option specifies how often the LIBNAME engine will issue a commit while bulk-loading. If this value is less than the value specified for INSERTBUFF= a commit will be issued after each write. This option only applies to data sources that support transactions.

Errors and Warnings

In most cases bulk-loading will work the same as normal writes, but there are cases where write operations will generate warnings or errors. Because the LIBNAME engine is streaming the data to the server it won’t get error notifications right away. If the bulk-load generates error or warning messages they will be written to the SAS® Federation Server log and the LIBNAME engine will output log messages indicating the number of rows that generated warnings or errors. An example would be:

NOTE: The data set OUTPUT.EXAMPLE has 99000 observations and 152 variables.
NOTE: BulkLoad rows added=98897 
ERROR: BulkLoad row errors=103
WARNING: BulkLoad row warnings=731

In this example, the LIBNAME engine tried to write out 99,000 rows, but was only able to write out 98,897 rows. Of those 98,897 rows, 731 rows had warnings. Details about these warnings and errors will be contained in the SAS® Federation Server log.

By default, the bulk-load operation will continue even if errors or warnings are encountered. The following options can temporarily disable bulk-load in the event of problems, although the LIBNAME engine will continue to insert rows normally until all data has been sent to the server. Once bulk-load is disabled any further errors or warnings will be returned to the LIBNAME engine which will handle and report them normally. Because the rows are streamed to the server before errors can be detected the bulk-loader will discard any pending rows when it is disabled.

BLERRLIMIT=<max-number-of-errors>

The bulk-loader will be disabled when it is unable to insert this number of rows. Any rows in pending bulk-load blocks will be discarded and marked as errors.

BLWARNLIMIT=<max-number-of-warnings>

The bulk-loader will be disabled when this number of rows generate warnings. Any rows in pending bulk-load blocks will be discarded and marked as errors.

Things to Consider

The BULKLOAD= option will be ignored if the table being written to contains long data (BLOBs, CLOBs, or NCLOBs).

The BULKLOAD= option only applies when inserting data. Selects, updates, and deletes will be handled normally even when BULKLOAD= is specified.

Errors and warnings from the data source will not be reported immediately because the data is being streamed. This means that the LIBNAME engine may continue to try to insert rows even after a failure occurs and data source specific details about any problems will be in the server log rather than in the client log.

Using bulk-load for small amounts of data is often slower than normal inserts due to the additional overhead. Check the size of the data being inserted before using BULKLOAD. While the specific size can vary based on the database, network, and servers being used in general you will begin to see a benefit when the data exceeds a thousand rows and a megabyte of data.

The SAS® Federation Server LIBNAME engine BULKLOAD= option speeds up the transfer of data from the client to the SAS® Federation Server, but you may also want to consider using database specific bulk-load options when creating your SAS® Federation Server data services to further improve performance.

Conclusion

Normal insert performance consists of the time the client uses to generate and supply the data to the server plus the time it takes the server to insert into the data source. Streaming the data from the client to the server allows bulk-load to overlap the operations and reduce the round-trip latency. Write speeds will always be limited by the data source, but bulk-load can help minimize the time by making sure that the client doesn’t have to wait while writing data.