SAS Viya 3.5 : CAS Action EPCS for SPARK

With SAS Viya3.5 release, there is a new CAS action to improve the performance of data transfer between Spark/Hadoop and CAS. The new CAS action is available with SAS Data Connect accelerator to Hadoop/Spark. A CAS session initiates a continuous SAS Embedded Process for Spark at Hadoop cluster for a list of table data transfer between CAS and Spark/Hadoop.

The EPCS continuous session provides tight integration between CAS and Spark by processing multiple execution requests without having to start and close the EP process for each request. The continuous SAS EP process for Spark at the Hadoop cluster is for each CAS session. The continuous EP process improves the performance of data transfer as it does not have to start a new EP process for each request. The initial resources allocated once at the hadoop cluster and subsequent Spark request utilize it.

List of CAS actions for SAS Embedded Process for Spark:

startSparkEP - to start SAS EP for Spark continuous session.
stopSparkEP - to stop SAS EP for Spark continuous session.

Pre-requisites:

SAS Data Connect Accelerator to Hadoop/Spark installed at CAS nodes.
SAS EP installed at Hadoop cluster.
Spark-2 and hive metastore available at hadoop cluster.

The following code describes the data load from Hadoop/Spark to CAS using continuous EP session. In a new CAS session, it creates a Hadoop CASLIB (session-level) with parallel transfer mode and spark platform. The Hadoop jar path includes the Spark jars. Step 1 starts the continuous Spark SAS EP process for ‘cashive’ CASLIB under the ‘mySession’ session. Step 2 loads a list of hive data tables to CAS under the same CAS session using the continuous Spark EP process. Step 3 closes the continuous Spark SAS EP process.

Code:

CAS mySession  SESSOPTS=(messagelevel=all CASLIB="public" TIMEOUT=999 LOCALE="en_US");

/* Define a caslib for parallel data transfer */
caslib cashive datasource=(srctype="hadoop",
   server="server.example.com",
   username="hadoop",
   dataTransferMode="parallel", 
   platform="spark", 
   hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf",
   hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/config/data/hadoop/lib/spark", 
   schema="default" ,
   dfdebug="epall");

proc cas;       /* Step 1*/                                              
  session mySession;
   action sparkEmbeddedProcess.startsparkep        caslib="cashive" 
   executorInstances=16, executorCores=2 ;   
run; 
quit;

proc casutil incaslib="cashive" outcaslib="cashive";             /* Step 2 */
   load casdata="dm_fact_mega_corp_1g"  casout="dm_fact_mega_corp_1g" replace ;
   load casdata="dm_fact_mega_corp_2g"  casout="dm_fact_mega_corp_2g" replace ;
   load casdata="dm_fact_mega_corp_5g"  casout="dm_fact_mega_corp_5g" replace ;
   load casdata="dm_fact_mega_corp_10g"  casout="dm_fact_mega_corp_10g" replace ;
   load casdata="dm_fact_mega_corp_20g"  casout="dm_fact_mega_corp_20g" replace ;
run ;
quit; 

proc cas;            /* Step 3*/
  session mySession;
   action sparkEmbeddedProcess.stopsparkep   caslib="cashive";
run; 
quit;

cas mySession terminate;

Log extract:

…………
……..
82   
83   proc cas;
84     session mySession;
85     sparkEmbeddedProcess.startsparkep
86             caslib="cashive";
87   run;
NOTE: Active Session now mySession.
NOTE: Added action set 'sparkEmbeddedProcess'.
NOTE: SAS Embedded Process implementation version:[1.7.23]. Full version:[17000].
NOTE: The SAS Embedded Process for Spark continuous session started. Tracking URL: 
      http://example.server.com:8088/proxy/application_1579556421270_0036/ 
88   quit;
……….
……………

82   
83   proc casutil incaslib="cashive" outcaslib="cashive";
NOTE: The UUID 'ac7fe4f0-3140-9142-ae9d-d3b72e86f2cb' is connected using session MYSESSION.
84      load casdata="dm_fact_mega_corp_1g"  casout="dm_fact_mega_corp_1g" replace ;
NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
NOTE: SAS Embedded Process tracking URL: http://example.server.com:8088/proxy/application_1579556421270_0036/ 
NOTE: Job Status ......: SUCCEEDED
….
…………….

NOTE: The Cloud Analytic Services server processed the request in 84.345992 seconds.
85      load casdata="dm_fact_mega_corp_2g"  casout="dm_fact_mega_corp_2g" replace ;
NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
NOTE: SAS Embedded Process tracking URL: http://example.server.com:8088/proxy/application_1579556421270_0036/ 
NOTE: Job Status ......: SUCCEEDED
….
……………

Notice the log, each data load CAS action using the same Hadoop application/Job Id to load the data from Hadoop/Spark to CAS.

As a result of invoking the “startSparkEP” CAS action, you can notice a continuous process at the Hadoop cluster scheduler, just like the Spark process. The following screenshot describes the same. Notice the number of running containers and allocated memory. The EPCS process will allocate these resources depending on the executorInstances= parameter. These resources impact the performance of the EPCS process.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

There are additional parameters with “startSparkEP” CAS action which lets you control the resources used at the Hadoop cluster for subsequent CAS actions. It's suggested to use taskCores=1 or leave it alone with default value for better performance. The following example describes the same.

sparkEmbeddedProcess.startsparkep
          caslib="cashive",
          executorInstances=4, executorCores=2, executorMemory=2, taskCores=1;

CAS load performance with EPCS session: The data load performance between CAS and Hadoop/Spark depends on HW resources and network traffic speed between CAS servers and the Hadoop cluster. The following test results are from an environment where both CAS and Hadoop cluster hosted at RACE servers.

Test environment: RACE CAS Servers = 1 + 4 Nodes – 32 GB Mem with 4 CPU on each node. RACE Hadoop Cluster = 1 + 3 Nodes – 16 GB Mem with 2 CPU on each node.

Run time: Important Link: SAS Embedded Process for Spark Action Set:

Rahul_B · ‎05-05-2020

Hi, Do you know if it is possible to Save or write back in memory / CAS dataset to hive ? and also if i can use EP (embeded processing ) to load from hive to CAS again ?

UttamKumar · ‎05-05-2020

Yes ! you can save CAS table/datastes to Hive. You can use SAS Hadoop EP for parallel data save and load from CAS to Hive. You can also use SAS Hadoop/Hive Data Connector for serial data save and load from CAS to hive.

-Uttam

Rahul_B · ‎05-06-2020

Hi,

Thanks for you reply.

I checked with the Admin looks like we have all the components of hadoop and Hive are installed.

Could please write an example code with libname specs would be helpful.

-Rahul

UttamKumar · ‎05-08-2020

Hi Rahul,

Here is an sample code with LIBNAME reference to write/read SAS datasets to/from Hive using spark engine.

=========
option set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/config/data/hadoop/lib/spark";
option set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/hadoop/conf";
options sastrace=',,,d' sastraceloc=saslog nostsuffix sql_ip_trace=(note,source) msglevel=i;
options DBIDIRECTEXEC;

libname hivelib clear;

libname hivelib hadoop
        server="server.example.com"
        user="hadoop"
        database=default
        subprotocol=hive2
  /*use fetch conversion and strict mode control to have an error message when costly ORDER is generated*/
properties="hive.fetch.task.conversion=minimal;hive.fetch.task.conversion.threshold=-1;hive.mapred.mode=strict;hive.execution.engine=spark";

/* save SAS table to HIVE */
data hivelib.prdsal2;
set sashelp.prdsal2;
run;

==========

-Uttam

SAS Viya 3.5 : CAS Action EPCS for SPARK

Free course: Data Literacy Essentials

Get Started