Loading SAS data to Amazon Redshift…Don’t run it too quickly!

5 Likes

With the global trend to move applications and data to the cloud, SAS customers may also have to move their on-premises data to cloud storage or databases. If the cloud is AWS, then Redshift is a potential candidate. AWS Redshift is a massively parallel data warehousing database, very easy and quick to spin up.

SAS can work with Redshift data very efficiently, whether it is for loading data into Redshift, extracting data from Redshift or processing data inside Redshift. Check Stephen Foerster’s article for an overview of SAS integration with Redshift. In this article, I’ll focus on loading data from SAS (9.4 or SPRE) to Redshift, which could be a typical customer scenario when one wants to progressively move/migrate its data to the cloud.

So, don’t go too fast! Don’t jump too quickly on your keyboard!

There are various options to load SAS data in Redshift. So, you might want to evaluate all of them before clicking on the running man. And I can guarantee some options are worth it.

Setup

Assume you have a 15 million rows LINEORDER table to load in Redshift. You first setup 2 libraries:

libname myrs redshift server="<Your-Redshift-server>"
                      database="<Your-Redshift-database>"
                      schema="<Your-Redshift-schema>"
                      user="<Your-Redshift-user>"
                      password="<Your-Redshift-password>" ;
libname local "~/data" ;

Then, you create the target table structure:

proc sql ;
   connect using myrs as myrs_pt ;
   execute(
      CREATE TABLE lineorder 
      (
        lo_orderkey          INTEGER NOT NULL,
        lo_linenumber        INTEGER NOT NULL,
        lo_custkey           INTEGER NOT NULL,
        lo_partkey           INTEGER NOT NULL,
        lo_suppkey           INTEGER NOT NULL,
        lo_orderdate         INTEGER NOT NULL,
        lo_orderpriority     VARCHAR(15) NOT NULL,
        lo_shippriority      VARCHAR(1) NOT NULL,
        lo_quantity          INTEGER NOT NULL,
        lo_extendedprice     INTEGER NOT NULL,
        lo_ordertotalprice   INTEGER NOT NULL,
        lo_discount          INTEGER NOT NULL,
        lo_revenue           INTEGER NOT NULL,
        lo_supplycost        INTEGER NOT NULL,
        lo_tax               INTEGER NOT NULL,
        lo_commitdate        INTEGER NOT NULL,
        lo_shipmode          VARCHAR(10) NOT NULL
      ) ;
   ) by myrs_pt ;
   disconnect from myrs_pt ;
quit ;

The first try

Now you are ready to load data into the empty Redshift table. Let’s run a basic proc append. Let’s NOT do it on the entire table first.

86    proc append base=myrs.lineorder
87                data=local.lineorder(obs=50000) ;
88    run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 50000 observations read from the data set LOCAL.LINEORDER.
NOTE: 50000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           54.37 seconds
      cpu time            1.82 seconds

That works well. But 54 seconds to load 50,000 records, it will take me 4.5 hours to load the entire table. There must be a better solution.

Buffers, buffers, buffers!

SAS/ACCESS has been providing options to control various buffers for years: READBUFF, INSERTBUFF, UPDATEBUFF.

What is the default INSERTBUFF value for Amazon Redshift? And can I expect gains if I increase this buffer? We’ll know it soon. Let’s do it.

How to know the default INSERTBUFF value?

Run this piece of code and look for INSERTBUFF in the SAS log.

options sastrace=",,d," sastraceloc=saslog ;
proc append base=myrs.lineorder
            data=local.lineorder(obs=1) ;
run ;
options sastrace=off ;

SAS log (partial):

REDSHIFT: Autoguess INSERTBUFF = 250 20359 1571671882 no_name 0 APPEND  
REDSHIFT: Enter setinsertbuff, table is LINEORDER, numrows = 250, statement 0, connection 2 20362 1571671882 no_name 0 APPEND

250 is the default value for Redshift. Let’s try different values like this:

135   proc append base=myrs.lineorder(insertbuff=4096)
136               data=local.lineorder(obs=50000) ;
137   run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 50000 observations read from the data set LOCAL.LINEORDER.
NOTE: 50000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           3.80 seconds
      cpu time            0.43 seconds

Results:

admin image 1.jpg

Huh? Pretty awesome. Divided the run time by 16. 94% faster!

Setting INSERTBUFF above 4096 gives similar run times. I can try with more data to see how it behaves:

84    proc append base=myrs.lineorder(insertbuff=32767)
85                data=local.lineorder(obs=5000000) ;
86    run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 5000000 observations read from the data set LOCAL.LINEORDER.
NOTE: 5000000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
NOTE: PROCEDURE APPEND used (Total process time):
      real time           5:26.41
      cpu time            39.55 seconds

5:26.41 to load 5,000,000 records in Redshift. I can expect to load my entire table in about 16 minutes. That’s a great improvement compared to the default 4.5 hours.

**Bulk loading, the “nec plus ultra”?**

Wait. There’s also this bulk loading mechanism. Is it available with Redshift? Yes.

Unlike other bulk loading capabilities available with other databases which sometimes require additional software components like SQL*Loader for Oracle, the Redshift one relies only on using AWS S3 as a staging area for moving data.

When bulk loading is active, SAS exports the SAS data set as a set of text files (dat extension) using a default delimiter (the bell character), loads them in AWS S3 using the AWS S3 API, and finally run a Redshift COPY command to load the text files into an existing Redshift table. This is particularly efficient.

admin image 2.png

From a code perspective, it looks like this:

84    proc append base=myrs.lineorder(bulkload=yes
84  ! bl_bucket="sas-viyadeploymentworkshop/gel/aws_data_management/redshift/temp_bulk_loading")
85                data=local.lineorder(obs=5000000) ;
86    run ;
NOTE: Appending LOCAL.LINEORDER to MYRS.LINEORDER.
NOTE: There were 5000000 observations read from the data set LOCAL.LINEORDER.
NOTE: 5000000 observations added.
NOTE: The data set MYRS.LINEORDER has . observations and 17 variables.
  62 1571739581 no_name 0 APPEND  
REDSHIFT_15: Executed: on connection 2 63 1571739581 no_name 0 APPEND  
copy "public".LINEORDER ("lo_orderkey","lo_linenumber","lo_custkey","lo_partkey","lo_suppkey","lo_orderdate","lo_orderpriority","lo_shippriority","lo_quantity","lo_extendedprice","lo_ordertotalprice","lo_discount","lo_revenue","lo_supplycost","lo_tax","lo_commitdate","lo_shipmode") FROM 's3://sas-viyadeploymentworkshop/gel/aws_data_management/redshift/temp_bulk_loading/SASRSBL_1AC33569-9638-3B47-9714-E1F5D307B619.manifest' ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' SESSION_TOKEN '' DELIMITER '\007' MANIFEST  REGION 'us-east-1'  64 1571739581 no_name 0 APPEND  
  65 1571739581 no_name 0 APPEND  
NOTE: PROCEDURE APPEND used (Total process time):
      real time           31.91 seconds
      cpu time            20.52 seconds

Wow! Divided the run time again by 10. 90% faster than using INSERTBUFF! I can expect to load my entire table in about 1:30 (90 seconds). Amazing.

In order to use the Redshift bulk loading feature, you need to have the AWS keys properly setup. In my case, the AWS keys were defined in my user profile, under a .aws sub-directory. SAS gets them automatically and I had only to specify the target AWS S3 bucket to store the temporary files in, using the BL_BUCKET option. You might have to use additional bulk loading options to set your AWS profile or config file or AWS keys directly in the program (though not recommended). For more information, check the documentation.

Recap

So, from the default behavior to the bulk load test, I was able to reduce the run time by 99.4%.

The environment used

The numbers observed were measured in a particularly favorable environment for that scenario since SAS Viya (and SPRE) was deployed on AWS EC2 machines residing in the same AWS region as the Redshift single-node cluster. Thus, the bandwidth between SAS, S3 and Redshift was very high.

This is probably not the use case I mentioned earlier when I was talking about potential customers moving/migrating their data from on-premises to the cloud. However, it shows how efficient SAS can be with Redshift data when SAS is deployed in AWS, which is a scenario that will become very common, if it’s not already the case today.

Notice that the Redshift bulk loading principle (using S3 as a staging area between SAS and Redshift) applies to many data loading/unloading situations using SAS, even if in this blog I focused only on loading data from SAS to Redshift using a SAS engine:

SAS can also “bulk unload” Redshift data to accelerate data reading in SAS
CAS can also “bulk load” Redshift data (in CAS language it’s a SAVE CASDATA) and will be able (in Viya 3.5) to “bulk unload” Redshift data (in CAS language it’s a LOAD CASDATA)

Additional considerations

To simulate moving data from on-premises to the cloud, I took an extreme case. Loading SAS data from my SAS installation on my laptop located in France (not a very good idea when the Redshift instance is in Northern Virginia) using a poor upload bandwidth. Resulting timings are not as awesome as previously (SAS deployed in AWS) but you get an idea of the performance of the different options.

admin image 4.jpg

In this context, an additional option could be very useful when the bandwidth is limited. BL_COMPRESS compresses the data files using the gzip format on the SAS engine machine before moving them to AWS S3. This is more CPU intensive on the SAS machine. In this case, this improved the bulk loading by around 2.5 times. And it is 5 times faster than the default loading option.

Thanks for reading.

ChrisNZ · ‎12-03-2019

Nice benchmark Nicolas! I'll point to it in the next edition of my book. Benchmarked comparisons are the best; Without figures to back them, performance discussions are just fluff. 🙂

BarryParish2 · ‎05-04-2021

Are all of the examples done with SAS/ACCESS for Redshift installed, or can the be done with SAS/ACCESS for ODBC? Used the first two methods successfully with smaller amounts of data with SAS/ACCESS for ODBC, but not the Bulk Load.

NicolasRobert · ‎05-10-2021

Thanks for your message @BarryParish2 .

No there won't be the same capabilities between SAS/ACCESS to Redshift and SAS/ACCESS to ODBC. Although there seems to be a BULKLOAD option in SAS/ACCESS to ODBC, it is not equivalent to the SAS/ACCESS to Redshift bulkload option which relies on S3.