BookmarkSubscribeRSS Feed

Redshift CASLIB for write access: make it right the first time

Started ‎04-24-2020 by
Modified ‎04-24-2020 by
Views 3,283

If you work with Amazon Redshift data from SAS Viya and CAS and plan to allow SAS users to save CAS data in Redshift, then you might want to read this blog. Indeed, saving a CAS table to Redshift requires special attention with respect to the CASLIB definition. If you just do the basics with your SAS/ACCESS to Redshift configuration, for a POC or a real implementation, then you might quickly get a call from your users or your prospect/customer: “How did you configure the access to Redshift? Saving data to Redshift is terribly slow! Seriously, this is not usable!”

 

Well, if that happened, they wouldn’t be wrong.

 

But that’s not SAS’ fault. As already mentioned in a previous article, Redshift inserts are slow. Very slow. What you want to avoid is to make SAS users feel the slowness comes from SAS.

 

Bulk Loading (CAS -> Redshift) and Bulk Unloading (Redshift -> CAS) are very important capabilities present in SAS/ACCESS Interface to Amazon Redshift. They were introduced respectively in SAS Viya 3.4 and 3.5.

 

If Bulk Unloading could be seen as optional when it comes to reading Redshift data from CAS (because CAS reading performance is already great), Bulk Loading is mandatory if you want to save data from CAS to Redshift.

 

Let’s take an example. If you define the following CASLIB:

 

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
 database="mydb" schema="public" user="myuser" password="XXXXXX") ;

 

And try to save some data from CAS to Redshift, you might observe this:

 

82   proc casutil incaslib="rs" outcaslib="rs" ;
NOTE: The UUID '496b3fdb-bca4-164e-8d60-d679b9867868' is connected using session MYSESSION.
83    save casdata="prdsale_cas" casout="prdsale_cas" replace ;
NOTE: Performing serial SaveTable action using SAS Data Connector to Redshift.
NOTE: Cloud Analytic Services saved the file prdsale_cas in caslib RS.
NOTE: The Cloud Analytic Services server processed the request in 283.45948 seconds.
84   quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           4:43.46
      cpu time            0.28 seconds

 

Yes. 4:43.46 (283 seconds) to upload 1,440 observations and a few columns to Redshift from CAS. 5 rows per second. This is pretty bad. Can you imagine the SAS user experience when saving data to Redshift using SAS Viya web user interfaces, even with small tables?

 

Multi-node writing doesn’t help that much with this process.

 

Here, you want to enable Bulk Loading.

 

Redshift Bulk Loading relies on using AWS S3 as a staging area for moving data. When bulk loading is active, CAS exports the CAS table as a set of text files (dat extension) using a default delimiter (the bell character), loads them in AWS S3 using the AWS S3 API, and finally runs a Redshift COPY command to load the text files into an existing Redshift table. This is way faster than the standard insert.

 

I know that this method has additional requirements such as having to deal with AWS keys in a multi-user context which might be tricky in some configurations. But, using standard inserts is not an option.

 

If you convert the previous CASLIB statement to the following one:

 

caslib rsbulk datasource=(srctype="redshift" server="myredshift.amazonaws.com"
   database="mydb" schema="public" user="myuser" password="XXXXXX"
   bulkload=true
   bucket="mybucket/redshift_bulk_loading",
   credentialsfile="/opt/sas/viya/config/data/AWSData/credentials"
   region='us-east-1') ;

 

Saving the same CAS table to Redshift will look like this:

 

123  proc casutil incaslib="rsbulk" outcaslib="rsbulk" ;
NOTE: The UUID 'f7f38181-0406-ec40-bf52-f28309751e83' is connected using session MYSESSION.
124   save casdata="prdsale_cas" casout="prdsale_cas" replace ;
NOTE: Performing serial SaveTable action using SAS Data Connector to Redshift.
NOTE: Cloud Analytic Services saved the file prdsale_cas in caslib RSBULK.
NOTE: The Cloud Analytic Services server processed the request in 1.78449 seconds.
125  quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           1.79 seconds
      cpu time            0.02 seconds

 

Much better, isn’t it?

 

You may ask: how can I setup a Redshift CASLIB like this in the User Interface (“Manage Data” or “Manage Environment -> Data”)?

 

Here it is:

 

nir_post48_redshift_caslib_settings-559x1024.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

nir_post48_redshift_caslib_advanced.png

 

If you want to know more about how SAS works with Amazon Redshift, don’t hesitate to check out my SGF 2020 paper: SAS and Amazon Redshift: Overview of Current Capabilities.

 

Thanks for reading.

Comments
Tom

We did something similar for regular SAS uploads to Redshift.  How does this method handle cases where the data contains the delimiter character?  What about data that contains end of line characters?  See:  https://communities.sas.com/t5/SASware-Ballot-Ideas/Enhancements-to-INFILE-FILE-to-handle-delimited-...

Version history
Last update:
‎04-24-2020 09:47 AM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags