SAS & Amazon Redshift Series: Process Redshift data in place from SAS (9.4 or Compute Server)

In this series, I want to share more general information about the integration between SAS and Redshift. After loading Redshift data in CAS, saving CAS data in Redshift and processing Redshift data in place from CAS, let’s terminate this series by exploring the capabilities SAS offers when you are using SAS 9.4 or the SAS Compute Server in SAS Viya and you want to process Redshift data in place without moving it. When you have data residing in Redshift, you probably want to analyze it or manipulate it in different ways in order to build customized data sets that are required by your analytics processes. You can certainly move the data from Redshift down to SAS and use the full stack of SAS capabilities. But you might want to know about the SAS® In-Database features that enable SAS users to transparently work with Redshift data without moving the data. This topic focuses on efficiently processing Redshift data in place, minimizing data movement between SAS and Redshift.

Implicit Pass-Through

Implicit pass-through is the process of translating SAS code into equivalent data source-specific SQL code so that it can be passed directly to the data source for processing. Implicit pass-through improves query response time and enhances security. A lot of SAS users use the SQL language to work with SAS and third-party data, through the SAS SQL procedure. If you don’t specify any options, SAS tries to push the processing to the database as much as it can. For example, consider the following SQL code against Redshift data:

/* Standard Redshift library */
libname myrs redshift server="myredshift.amazonaws.com"
                      database="mydb" schema="public"
                      user="myuser" password="XXXXXX" ;

/* SAS Library */
libname local "~/data" ;

proc sql ;
   create table local.orders as select *
   from myrs.part p, myrs.supplier s, myrs.customer c, myrs.dwdate d, myrs.lineorder lo
   where p.p_partkey=lo.lo_partkey and
         s.s_suppkey=lo.lo_suppkey and
         c.c_custkey=lo.lo_custkey and
         d.d_datekey=lo.lo_orderdate ;
quit ;

In this code, the user wants to join multiple tables from Redshift and store the result in a SAS table. But where does this join process occur? In SAS, after having downloaded all of the tables from Redshift to SAS, or in Redshift? If you set the right option prior to running your code, you can get some useful information in the SAS log about how your SAS code is transformed into database code:

options sastrace=',,,d' sastraceloc=saslog nostsuffix msglevel=i ;

This code returns the following:

REDSHIFT_31: Prepared: on connection 0
 select p."p_partkey", p."p_name", p."p_mfgr", p."p_category", p."p_brand1", p."p_color", p."p_type", p."p_size", p."p_container", 
s."s_suppkey", s."s_name", s."s_address", s."s_city", s."s_nation", s."s_region", s."s_phone", c."c_custkey", c."c_name", 
c."c_address", c."c_city", c."c_nation", c."c_region", c."c_phone", c."c_mktsegment", d."d_datekey", d."d_date", d."d_dayofweek", 
d."d_month", d."d_year", d."d_yearmonthnum", d."d_yearmonth", d."d_daynuminweek", d."d_daynuminmonth", d."d_daynuminyear", 
d."d_monthnuminyear", d."d_weeknuminyear", d."d_sellingseason", d."d_lastdayinweekfl", d."d_lastdayinmonthfl", d."d_holidayfl", 
d."d_weekdayfl", lo."lo_orderkey", lo."lo_linenumber", lo."lo_custkey", lo."lo_partkey", lo."lo_suppkey", lo."lo_orderdate", 
lo."lo_orderpriority", lo."lo_shippriority", lo."lo_quantity", lo."lo_extendedprice", lo."lo_ordertotalprice", lo."lo_discount", 
lo."lo_revenue", lo."lo_supplycost", lo."lo_tax", lo."lo_commitdate", lo."lo_shipmode" from "public".PART p, "public".SUPPLIER s, 
"public".CUSTOMER c, "public".DWDATE d, "public".LINEORDER lo where (p."p_partkey" = lo."lo_partkey") and (s."s_suppkey" = 
lo."lo_suppkey") and (c."c_custkey" = lo."lo_custkey") and (d."d_datekey" = lo."lo_orderdate")
 
 
REDSHIFT_32: Executed: on connection 0
Prepared statement REDSHIFT_31
 
ACCESS ENGINE:  SQL statement was passed to the DBMS for fetching data.

In this case, the join happens in Redshift and the resulting data is fetched on SAS because the user asked for it. Now consider the same code, but instead of creating a SAS table, the user wants to create a Redshift table:

proc sql ;
   create table myrs.orders as select *
   from myrs.part p, myrs.supplier s, myrs.customer c, myrs.dwdate d, myrs.lineorder lo
   where p.p_partkey=lo.lo_partkey and
         s.s_suppkey=lo.lo_suppkey and
         c.c_custkey=lo.lo_custkey and
         d.d_datekey=lo.lo_orderdate ;
quit ;

This code returns the following:

REDSHIFT_44: Executed: on connection 1
CREATE  TABLE 
"public".orders(p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container,s_suppkey,s_name,s_address,s_city,s_na
tion,s_region,s_phone,c_custkey,c_name,c_address,c_city,c_nation,c_region,c_phone,c_mktsegment,d_datekey,d_date,d_dayofweek,d_month,
d_year,d_yearmonthnum,d_yearmonth,d_daynuminweek,d_daynuminmonth,d_daynuminyear,d_monthnuminyear,d_weeknuminyear,d_sellingseason,d_l
astdayinweekfl,d_lastdayinmonthfl,d_holidayfl,d_weekdayfl,lo_orderkey,lo_linenumber,lo_custkey,lo_partkey,lo_suppkey,lo_orderdate,lo
_orderpriority,lo_shippriority,lo_quantity,lo_extendedprice,lo_ordertotalprice,lo_discount,lo_revenue,lo_supplycost,lo_tax,lo_commit
date,lo_shipmode) as ( select p."p_partkey", p."p_name", p."p_mfgr", p."p_category", p."p_brand1", p."p_color", p."p_type", 
p."p_size", p."p_container", s."s_suppkey", s."s_name", s."s_address", s."s_city", s."s_nation", s."s_region", s."s_phone", 
c."c_custkey", c."c_name", c."c_address", c."c_city", c."c_nation", c."c_region", c."c_phone", c."c_mktsegment", d."d_datekey", 
d."d_date", d."d_dayofweek", d."d_month", d."d_year", d."d_yearmonthnum", d."d_yearmonth", d."d_daynuminweek", d."d_daynuminmonth", 
d."d_daynuminyear", d."d_monthnuminyear", d."d_weeknuminyear", d."d_sellingseason", d."d_lastdayinweekfl", d."d_lastdayinmonthfl", 
d."d_holidayfl", d."d_weekdayfl", lo."lo_orderkey", lo."lo_linenumber", lo."lo_custkey", lo."lo_partkey", lo."lo_suppkey", 
lo."lo_orderdate", lo."lo_orderpriority", lo."lo_shippriority", lo."lo_quantity", lo."lo_extendedprice", lo."lo_ordertotalprice", 
lo."lo_discount", lo."lo_revenue", lo."lo_supplycost", lo."lo_tax", lo."lo_commitdate", lo."lo_shipmode" from "public".PART p, 
"public".SUPPLIER s, "public".CUSTOMER c, "public".DWDATE d, "public".LINEORDER lo where (p."p_partkey" = lo."lo_partkey") and 
(s."s_suppkey" = lo."lo_suppkey") and (c."c_custkey" = lo."lo_custkey") and (d."d_datekey" = lo."lo_orderdate") )

This time, the whole process occurs in Redshift, with no data movement between Redshift and SAS. Implicit pass-through is also possible on multi-table operations involving multiple librefs. However, in order for this to work, these LIBNAME properties must match exactly for all of the Redshift libraries: user ID, password, server, database, port, and SQL functions option value. See the “Passing Joins to Amazon Redshift” in SAS/ACCESS for Relational Databases documentation. In addition to SQL implicit pass-through, note that DATA step allows some processing parts, especially filtering conditions, to be implicitly sent to the database for execution. What is interesting is that SAS automatically translates SAS functions to the corresponding Redshift functions.

82   data promos ;
83      set myrs.orders(where=(index(p_type,"PROMO")>0)) ;
84   run ;
 
REDSHIFT_48: Prepared: on connection 0
SELECT  "p_type", "p_partkey", "p_name", "p_mfgr", "p_category", "p_brand1", "p_color", "p_size", "p_container", "s_suppkey", 
"s_name", "s_address", "s_city", "s_nation", "s_region", "s_phone", "c_custkey", "c_name", "c_address", "c_city", "c_nation", 
"c_region", "c_phone", "c_mktsegment", "d_datekey", "d_date", "d_dayofweek", "d_month", "d_year", "d_yearmonthnum", "d_yearmonth", 
"d_daynuminweek", "d_daynuminmonth", "d_daynuminyear", "d_monthnuminyear", "d_weeknuminyear", "d_sellingseason", 
"d_lastdayinweekfl", "d_lastdayinmonthfl", "d_holidayfl", "d_weekdayfl", "lo_orderkey", "lo_linenumber", "lo_custkey", 
"lo_partkey", "lo_suppkey", "lo_orderdate", "lo_orderpriority", "lo_shippriority", "lo_quantity", "lo_extendedprice", 
"lo_ordertotalprice", "lo_discount", "lo_revenue", "lo_supplycost", "lo_tax", "lo_commitdate", "lo_shipmode"  FROM "public".ORDERS  
WHERE  ( STRPOS( "p_type", 'PROMO') > 0 )
 
 
REDSHIFT_49: Executed: on connection 0
Prepared statement REDSHIFT_48
 
NOTE: There were 4174 observations read from the data set MYRS.ORDERS.
      WHERE INDEX(p_type, 'PROMO')>0;
NOTE: The data set WORK.PROMOS has 4174 observations and 58 variables.
NOTE: DATA statement used (Total process time):
      real time           4.58 seconds
      cpu time            0.06 seconds

See “Passing SAS Functions to Amazon Redshift” in SAS/ACCESS for Relational Databases for more information about the SAS functions that can be translated into Redshift functions. Implicit pass-through is a powerful optimization technique when you don’t know the SQL specifics of an accessed database in detail. It is also useful in tools such as SAS® Enterprise Guide® or SAS® Studio, because they mostly generate standard SAS SQL code. Implicit pass-through is driven by these options, which are enabled by default:

DIRECT_EXE | DBIDIRECTEXEC
DIRECT_SQL
SQL_FUNCTIONS

Explicit Pass-Through

Of course, you can still use explicit pass-through if you want to be in total control of the SQL code that is submitted to Redshift. This is done through the use of the SQL procedure with some special statements:

/* Explicit pass-through - explicit credentials - select data */
proc sql ;
   connect to redshift as myrs_pt(server="myredshift.amazonaws.com"
                                  database="mydb" user="myuser"
                                  password="XXXXXX") ;
   select * from connection to myrs_pt(
      select c_nation as customer_country, s_nation as supplier_country, 
             sum(lo_quantity) as qty, sum(lo_revenue) as revenue
      from public.orders
      group by c_nation, s_nation ;
   ) ;
   disconnect from myrs_pt ;
quit ;

/* Explicit pass-through - implicit credentials - run specific Redshift commands */
proc sql ;
   connect using myrs as myrs_pt ;
   execute (vacuum ;) by myrs_pt ;
   execute (analyze ;) by myrs_pt ;
   disconnect from myrs_pt ;
quit ;

To run a specific SQL string that returns a result set, you use the SELECT ... FROM CONNECTION TO ... syntax. Otherwise, if your SQL string does not produce any output, you use the EXECUTE statement. Notice you have two ways to specify the database connection string. You can either specify it explicitly (connect to redshift as) or use an already existing library that contains the connection string (connect using).

Use FedSQL

Instead of the SQL procedure, you can use the FedSQL procedure to interact with Redshift. There might be benefits in using FedSQL instead of SQL (for example, ANSI SQL:1999 standard, data types, and so on.) See “Benefits of FedSQL” in SAS® 9.4 FedSQL Language Reference.

This is an example:

proc fedsql iptrace ;
   create table myrs.test_fedsql
      (
        key    TINYINT NOT NULL,
        field1 BIGINT NOT NULL
      ) ;
   insert into myrs.test_fedsql values (0,123456789012345) ;
   insert into myrs.test_fedsql values (1,123456789012345) ;
   insert into myrs.test_fedsql values (2,123456789012345) ;
quit ;

Notice that you cannot run this code using PROC SQL because it contains PROC SQL unsupported data types. But you can run it using explicit pass-through.

Run SAS Procedures In-Database

These common and useful Base SAS® procedures have been enhanced for in-database processing inside Amazon Redshift:

FREQ
RANK
REPORT
SORT
SUMMARY/MEANS
TABULATE

When you run these SAS procedures against Redshift data, most, if not all, of the complex processing is performed on the database side. This means that the SAS procedures are automatically converted into SQL code that runs close to the data, with no data movement. Only the result set, which is expected to be small, is downloaded into SAS for further processing and presentation on the client. For example, running a TABULATE procedure on a huge Redshift table triggers an SQL summarization in Redshift. SAS is not involved during the initial computations. The result set, which should fit in a report table, is gathered in SAS for the presentation in the SAS user interface. Some procedures can be totally offloaded in Redshift. If you run the following RANK procedure against a Redshift table and you want the output result to also be a Redshift table, SAS/ACCESS to Redshift pushes down some SQL statements to make it happen without anything being computed on the SAS side.

Here is a partial extract from the SAS log:

82   proc rank data=myrs.orders out=myrs.ranks ;
83      var lo_revenue lo_supplycost ;
84      ranks rev_rank cost_rank ;
85   run ;
 
REDSHIFT_13: Executed: on connection 2
CREATE TABLE "public".RANKS (p_partkey NUMERIC(11),p_name VARCHAR(22),p_mfgr VARCHAR(6),p_category VARCHAR(7),p_brand1 
VARCHAR(9),p_color VARCHAR(11),p_type VARCHAR(25),p_size NUMERIC(11),p_container VARCHAR(10),s_suppkey NUMERIC(11),s_name 
...
 REDSHIFT: AUTOCOMMIT is YES for connection 2
  
 REDSHIFT_14: Executed: on connection 2
 INSERT INTO public.RANKS ( p_partkey, p_name, p_mfgr, p_category, p_brand1, p_color, p_type, p_size, p_container, s_suppkey, 
...
lo_revenue, lo_supplycost, lo_tax, lo_commitdate, lo_shipmode, 
 rev_rank, cost_rank) SELECT "table0"."p_partkey" AS p_partkey, "table0"."p_name" AS p_name, "table0"."p_mfgr" AS p_mfgr, 
...
"table0"."lo_shipmode" AS lo_shipmode, 
 "table1"."rankalias0" AS rev_rank, "table2"."rankalias1" AS cost_rank FROM ( SELECT "c_address" AS "c_address", "c_city" AS 
 "c_city", "c_custkey" AS "c_custkey", "c_mktsegment" AS "c_mktsegment", 
...
 "s_phone" AS "s_phone", "s_region" AS "s_region", "s_suppkey" AS "s_suppkey" FROM public.ORDERS ) AS "table0" LEFT JOIN ( SELECT 
 DISTINCT "lo_revenue", "tempcol0" AS "rankalias0" FROM ( SELECT "lo_revenue", AVG( "tempcol1" ) OVER ( PARTITION BY "lo_revenue" ) 
...
"s_suppkey" 
 AS "s_suppkey" FROM public.ORDERS ) AS "subquery7" WHERE ( ( "lo_supplycost" IS NOT NULL ) ) ) AS "subquery6" ) AS "subquery4" ) 
AS 
 "table2" ON ( ( "table0"."lo_supplycost" = "table2"."lo_supplycost" ) )
  
 REDSHIFT: 25182 row(s) affected by INSERT/UPDATE/DELETE or other statement.
 NOTE: PROCEDURE SQL used (Total process time):
       real time           0.63 seconds
       cpu time            0.06 seconds
       
 
NOTE: SQL generation was used to perform the ranking.
NOTE: PROCEDURE RANK used (Total process time):
      real time           0.77 seconds
      cpu time            0.14 seconds

This blog article is an extract of the SAS Global Forum paper SAS® and Amazon Redshift: Overview of Current Capabilities.

Thanks for reading.