SAS & Amazon Redshift Series: Load Redshift data in CAS

I’ve already addressed the importance of bulk loading with Amazon Redshift in 2 previous articles, from SAS 9.4 as well as from CAS. In this series, I want to share more general information about the integration between SAS and Redshift. Let’s start with the various loading methods available to load data from Redshift into CAS.

To manipulate, process, and analyze your Redshift data with SAS Viya capabilities, you first need to load it into CAS.

Standard Loading

There are many programmatic ways in SAS Viya to load Redshift data into CAS, but the best way to start with is to use the CASUTIL procedure, which performs a server-side load.

/* Standard Redshift CASLIB */
caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
                      database="mydb" schema="public"
                      user="myuser" password="XXXXXX") ;

proc casutil incaslib="rs" outcaslib="rs" ;
   load casdata="lineorder" casout="lineorder" ;
quit ;

Server-side loading means that the CAS cluster directly accesses the Redshift data source, the data goes directly from Redshift to the CAS controller, and the CAS controller distributes it to the CAS workers (if CAS is deployed in an MPP mode). Only the CAS controller must have the Redshift client (which is shipped with SAS Viya) configured in order to be able to connect to Redshift.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

READBUFF is also an option that can help in case of network latencies.

Multi-Node Loading

Multi-node loading is the process where CAS workers connect concurrently to the data source to load the data in parallel. The CAS controller coordinates the extracts. All of the CAS nodes must have the Redshift client configured in order to be able to connect directly to Redshift. CAS uses the SQL modulo function on the first suitable numeric column (if the user did not specify it using the SLICECOLUMN option) to instruct each CAS worker to query a portion of the data. Basically, in a three-worker configuration, each CAS worker queries and gets back one-third of the entire Redshift table (although it really depends on the numeric variable distribution).

Multi-node loading is driven by the CASLIB NUMREADNODES option, which specifies how many CAS nodes you want to use for querying the Redshift database in parallel.

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
                      database="mydb" schema="public" user="myuser"
                      password="XXXXXX" numreadnodes=10) ;

libname casrs cas caslib="rs" ;

proc casutil incaslib="rs" outcaslib="rs" ;
   load casdata="lineorder" casout="lineorder" ;
quit ;

NUMREADNODES can be set to these values:

0: use as many nodes as possible
1: use the CAS controller only (no multi-node mode)
A value greater than 1: use up to the number of nodes indicated
- NUMREADNODES=10 will use three nodes if your CAS cluster is made of three nodes
- NUMREADNODES=10 will use 10 nodes if your CAS cluster is made of 20 nodes

These are typical SQL queries sent to Redshift when multi-node is activated (lo_orderkey is the first suitable numeric column):

Node 1:

select "SLICE_SQL"."lo_orderkey", ...
 from  (select "public"."lineorder"."lo_orderkey", ... from
        "public"."lineorder")  "SLICE_SQL"
 where  ( ( (MOD (ABS ("SLICE_SQL"."lo_orderkey") , 3) =0)  or "SLICE_SQL"."lo_orderkey" IS NULL) )

Node 2:

select "SLICE_SQL"."lo_orderkey", ...
 from  (select "public"."lineorder"."lo_orderkey", ... from
        "public"."lineorder")  "SLICE_SQL"
where  ( (MOD (ABS ("SLICE_SQL"."lo_orderkey") , 3) =1) )

Node 3:

select "SLICE_SQL"."lo_orderkey", ...
 from  (select "public"."lineorder"."lo_orderkey", ... from
        "public"."lineorder")  "SLICE_SQL"
where  ( (MOD (ABS ("SLICE_SQL"."lo_orderkey") , 3) =2) )

Bulk Unloading (Redshift) and Multi-Node Loading (CAS)

This method consists of bulk unloading (new in SAS Viya 3.5) the Redshift table to AWS S3 (using the Redshift UNLOAD command under the covers) and a parallel loading of the resulting file in CAS.

Bulk unloading the Redshift table is only permitted when CAS MPP is used with multi-node loading enabled or a CAS SMP is used. It is not permitted when CAS MPP is used without multi-node loading enabled. The BULKUNLOAD options work together with the NUMREADNODES (value=0 or >1) option.

caslib rs datasource=(srctype="redshift" server="myredshift.amazonaws.com"
                      database="mydb" schema="public" user="myuser"
                      password="XXXXXX" numreadnodes=0) ;

proc casutil incaslib="rs" outcaslib="rs" ;
   load casdata="lineorder" casout="lineorder"
      datasourceoptions=(
         bulkunload=true
         awsconfig="/opt/sas/viya/config/data/AWSData/config"
         credentialsfile="/opt/sas/viya/config/data/AWSData/credentials"
         bucket="mybucket/redshift_bulk_loading"
      ) ;
quit ;

This article is an extract of the SAS Global Forum paper, SAS and Amazon Redshift: Overview of Current Capabilities.

Thanks for reading.

SAS Communities Library