BookmarkSubscribeRSS Feed

Redshift CASLIB for write access: make it right the first time

Started ‎04-24-2020 by
Modified ‎04-24-2020 by
Views 3,487

If you work with Amazon Redshift data from SAS Viya and CAS and plan to allow SAS users to save CAS data in Redshift, then you might want to read this blog. Indeed, saving a CAS table to Redshift requires special attention with respect to the CASLIB definition. If you just do the basics with your SAS/ACCESS to Redshift configuration, for a POC or a real implementation, then you might quickly get a call from your users or your prospect/customer: “How did you configure the access to Redshift? Saving data to Redshift is terribly slow! Seriously, this is not usable!”

 

Well, if that happened, they wouldn’t be wrong.

 

But that’s not SAS’ fault. As already mentioned in a previous article, Redshift inserts are slow. Very slow. What you want to avoid is to make SAS users feel the slowness comes from SAS.

 

Bulk Loading (CAS -> Redshift) and Bulk Unloading (Redshift -> CAS) are very important capabilities present in SAS/ACCESS Interface to Amazon Redshift. They were introduced respectively in SAS Viya 3.4 and 3.5.

 

If Bulk Unloading could be seen as optional when it comes to reading Redshift data from CAS (because CAS reading performance is already great), Bulk Loading is mandatory if you want to save data from CAS to Redshift.

 

Let’s take an example. If you define the following CASLIB:

 

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
 database="mydb" schema="public" user="myuser" password="XXXXXX") ;

 

And try to save some data from CAS to Redshift, you might observe this:

 

82   proc casutil incaslib="rs" outcaslib="rs" ;
NOTE: The UUID '496b3fdb-bca4-164e-8d60-d679b9867868' is connected using session MYSESSION.
83    save casdata="prdsale_cas" casout="prdsale_cas" replace ;
NOTE: Performing serial SaveTable action using SAS Data Connector to Redshift.
NOTE: Cloud Analytic Services saved the file prdsale_cas in caslib RS.
NOTE: The Cloud Analytic Services server processed the request in 283.45948 seconds.
84   quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           4:43.46
      cpu time            0.28 seconds

 

Yes. 4:43.46 (283 seconds) to upload 1,440 observations and a few columns to Redshift from CAS. 5 rows per second. This is pretty bad. Can you imagine the SAS user experience when saving data to Redshift using SAS Viya web user interfaces, even with small tables?

 

Multi-node writing doesn’t help that much with this process.

 

Here, you want to enable Bulk Loading.

 

Redshift Bulk Loading relies on using AWS S3 as a staging area for moving data. When bulk loading is active, CAS exports the CAS table as a set of text files (dat extension) using a default delimiter (the bell character), loads them in AWS S3 using the AWS S3 API, and finally runs a Redshift COPY command to load the text files into an existing Redshift table. This is way faster than the standard insert.

 

I know that this method has additional requirements such as having to deal with AWS keys in a multi-user context which might be tricky in some configurations. But, using standard inserts is not an option.

 

If you convert the previous CASLIB statement to the following one:

 

caslib rsbulk datasource=(srctype="redshift" server="myredshift.amazonaws.com"
   database="mydb" schema="public" user="myuser" password="XXXXXX"
   bulkload=true
   bucket="mybucket/redshift_bulk_loading",
   credentialsfile="/opt/sas/viya/config/data/AWSData/credentials"
   region='us-east-1') ;

 

Saving the same CAS table to Redshift will look like this:

 

123  proc casutil incaslib="rsbulk" outcaslib="rsbulk" ;
NOTE: The UUID 'f7f38181-0406-ec40-bf52-f28309751e83' is connected using session MYSESSION.
124   save casdata="prdsale_cas" casout="prdsale_cas" replace ;
NOTE: Performing serial SaveTable action using SAS Data Connector to Redshift.
NOTE: Cloud Analytic Services saved the file prdsale_cas in caslib RSBULK.
NOTE: The Cloud Analytic Services server processed the request in 1.78449 seconds.
125  quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           1.79 seconds
      cpu time            0.02 seconds

 

Much better, isn’t it?

 

You may ask: how can I setup a Redshift CASLIB like this in the User Interface (“Manage Data” or “Manage Environment -> Data”)?

 

Here it is:

 

nir_post48_redshift_caslib_settings-559x1024.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

nir_post48_redshift_caslib_advanced.png

 

If you want to know more about how SAS works with Amazon Redshift, don’t hesitate to check out my SGF 2020 paper: SAS and Amazon Redshift: Overview of Current Capabilities.

 

Thanks for reading.

Comments
Tom

We did something similar for regular SAS uploads to Redshift.  How does this method handle cases where the data contains the delimiter character?  What about data that contains end of line characters?  See:  https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

Version history
Last update:
‎04-24-2020 09:47 AM
Updated by:
Contributors

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags