BookmarkSubscribeRSS Feed

SAS & Amazon Redshift Series: Save CAS data in Redshift

Started ‎06-01-2020 by
Modified ‎06-01-2020 by
Views 2,075

In this series, I want to share more general information about the integration between SAS and Redshift. After loading Redshift data in CAS, let’s move forward with the various writing methods available to save CAS data into Redshift. In a SAS Viya data lifecycle, data is read and loaded into CAS, data is analyzed, resulting in new data sets, and possibly data is updated. Optionally, you might want to save your results back in Redshift. That’s where saving CAS tables comes into play.

Standard Saving

Without any options specified, it is possible to save a CAS table directly to Redshift. As mentioned in a previous article, Redshift inserts are slow. So it is best to avoid that method for writing data in Redshift.

 

A standard saving involves gathering the CAS data on the CAS controller before inserting it into a Redshift table.

 

nir_post_50_01_standard_save.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Here is the syntax of a standard saving:

 

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
                      database="mydb" schema="public" user="myuser"
                      password="XXXXXX") ;
proc casutil incaslib="casuser" outcaslib="rs" ;
   save casdata="orders" casout="orders" replace ;
quit ;

Multi-Node Saving

Multi-node saving improves standard loading by making the CAS workers communicate directly with the database. Each node is responsible for inserting their local portion of data into Redshift.

 

nir_post_50_02_multinode_save.png

 

Multi-node saving is driven by the CASLIB NUMWRITENODES option, which specifies how many CAS nodes you want to use in parallel for saving a CAS table into a Redshift database.

 

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
                      database="mydb" schema="public" user="myuser"
                      password="XXXXXX" numwritenodes=10) ;
proc casutil incaslib="casuser" outcaslib="rs" ;
   save casdata="orders" casout="orders" replace ;
quit ;

“Bulk Loading” Saving

Because of the poor Redshift insert performance, bulk loading in Redshift (available with SAS Viya 3.4) while saving a CAS table is the best solution to produce fast writing capabilities. Unlike the bulk unloading feature, which is supported only in MPP multi-node or in SMP, bulk loading can be done whether your CASLIB is defined with NUMWRITENODES=1 (the default value for saving from the CAS controller) or NUMWRITENODES=0 or >1 (multi-node saving).

 

The following picture illustrates the “single-node” scenario, where CAS data is gathered on the CAS controller, transferred to S3, and loaded into Redshift using the Redshift COPY statement.

 

nir_post_50_03_bulkload_save.png

 

In terms of code, the options are pretty similar to those used for loading data in CAS. NUMWRITENODES=1 means that only the CAS controller will communicate with S3.

 

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
                      database="mydb" schema="public" user="myuser"
                      password="XXXXXX" numwritenodes=1) ;

proc casutil incaslib="rs" outcaslib="rs" ;
   save casdata="orders" casout="orders" replace
      options=(
         bulkload=true
         awsconfig="/opt/sas/viya/config/data/AWSData/config"
         credentialsfile="/opt/sas/viya/config/data/AWSData/credentials"
         bucket="mybucket/redshift_bulk_loading"
      ) ;
quit ;

Multi-Node Saving (CAS) and Bulk Loading (Redshift)

This is a variant of the previous scenario, where you want to take advantage of the multiple CAS workers to stage their portions of CAS data in parallel and directly into S3 before loading the resulting files from S3 to Redshift using the COPY statement. This method skips the need for data gathering on the CAS controller.

 

nir_post_50_04_multinode_bulkload_save.png

 

Compared to the previous example, only the NUMWRITENODES value is different. A value of 0 means all CAS workers will upload their data to S3.

 

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
                      database="mydb" schema="public" user="myuser"
                      password="XXXXXX" numwritenodes=0) ;

proc casutil incaslib="rs" outcaslib="rs" ;
   save casdata="orders" casout="orders" replace
      options=(
         bulkload=true
         awsconfig="/opt/sas/viya/config/data/AWSData/config"
         credentialsfile="/opt/sas/viya/config/data/AWSData/credentials"
         bucket="mybucket/redshift_bulk_loading"
      ) ;
quit ;

 

This article is an extract of the SAS Global Forum paper SAS® and Amazon Redshift: Overview of Current Capabilities.

 

Thanks for reading.

Version history
Last update:
‎06-01-2020 11:31 AM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags