With SAS Viya3.5 release, there is a new CAS action to improve the performance of data transfer between Spark/Hadoop and CAS. The new CAS action is available with SAS Data Connect accelerator to Hadoop/Spark. A CAS session initiates a continuous SAS Embedded Process for Spark at Hadoop cluster for a list of table data transfer between CAS and Spark/Hadoop.
The EPCS continuous session provides tight integration between CAS and Spark by processing multiple execution requests without having to start and close the EP process for each request. The continuous SAS EP process for Spark at the Hadoop cluster is for each CAS session. The continuous EP process improves the performance of data transfer as it does not have to start a new EP process for each request. The initial resources allocated once at the hadoop cluster and subsequent Spark request utilize it.
List of CAS actions for SAS Embedded Process for Spark:
The following code describes the data load from Hadoop/Spark to CAS using continuous EP session. In a new CAS session, it creates a Hadoop CASLIB (session-level) with parallel transfer mode and spark platform. The Hadoop jar path includes the Spark jars. Step 1 starts the continuous Spark SAS EP process for ‘cashive’ CASLIB under the ‘mySession’ session. Step 2 loads a list of hive data tables to CAS under the same CAS session using the continuous Spark EP process. Step 3 closes the continuous Spark SAS EP process.
Notice the log, each data load CAS action using the same Hadoop application/Job Id to load the data from Hadoop/Spark to CAS.
As a result of invoking the “startSparkEP” CAS action, you can notice a continuous process at the Hadoop cluster scheduler, just like the Spark process. The following screenshot describes the same. Notice the number of running containers and allocated memory. The EPCS process will allocate these resources depending on the executorInstances= parameter. These resources impact the performance of the EPCS process.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
There are additional parameters with “startSparkEP” CAS action which lets you control the resources used at the Hadoop cluster for subsequent CAS actions. It's suggested to use taskCores=1 or leave it alone with default value for better performance. The following example describes the same.
CAS load performance with EPCS session: The data load performance between CAS and Hadoop/Spark depends on HW resources and network traffic speed between CAS servers and the Hadoop cluster. The following test results are from an environment where both CAS and Hadoop cluster hosted at RACE servers.
Test environment: RACE CAS Servers = 1 + 4 Nodes – 32 GB Mem with 4 CPU on each node. RACE Hadoop Cluster = 1 + 3 Nodes – 16 GB Mem with 2 CPU on each node.
Run time: Important Link: SAS Embedded Process for Spark Action Set:
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.