Parallel data load to SAS Viya

2 Likes

In SAS® Viya™ the analytic engine is the Cloud Analytic Services (CAS) Server, which uses high performance, multi-threaded analytic code to rapidly process requests against data of any size. Before you can use CAS to work with a data sets, you must load the data sets into the CAS server. With SAS Viya in the 16w20 release, SAS Studio® is the integrated programming environment for CAS. By using SAS Studio, you can write SAS code to load data from the source environment to CAS.

This post explores the parallel data load method from a Hadoop Hive table to the CAS server. Data can be loaded from the Hive table to CAS using one of two methods, the serial method or the parallel method. The Data Connector facilitates the serial method and Data Connector Accelerators facilitate the parallel method. To enable parallel data loading from the Hive table to the CAS server, you will also need SAS® Embedded Process installed on Hadoop cluster.

The concept of a parallel data load to CAS is similar to a parallel data load to a SAS® LASR™ Analytic Server. The following diagram depicts the parallel data load from a Hive table to the CAS environment. The data flows directly from the Hadoop data nodes to the CAS worker nodes.

Using SAS Viya with CAS, the caslib statement enables you to define a SAS library with datasource= option, to include data source connection information. The caslib connection to the source environment uses either the serial or parallel method. The following example shows a caslib statement with a parallel connection to the Hive environment, using parameter dataTransferMode=”parallel”. The parallel data transfer mode works provided that SAS Embedded Process is installed on the Hadoop cluster.

/* Assign EP HIVE CASLIB */

caslib hiveEP datasource=(srctype="hive",server="gatekrbhdp01.gatehadoop.com",

dataTransferMode="parallel",

hadoopconfigdir="/opt/sas/hadoop/client_conf",

hadoopjarpath="/opt/sas/hadoop/client_jar");

When PROC CASUTIL is executed against the above libref, along with the list table or files statement, the Result Tab displays detail information about how the data source library is connected to the Hadoop Hive environment.

When the data load to CAS statement is executed, the data will flow from the Hadoop Hive environment to CAS using the parallel route. Data is transferred from the Hadoop data nodes to the CAS worker nodes.

/* Load HIVE tables (In memory) */

proc casutil;

load casdata="stocks" casout="stocks" outcaslib="hiveEP" incaslib="hiveEP";

quit;

To verify that the data load from Hive to CAS is using the parallel route, you have to verify within the Hadoop cluster using the MapReduce job log from the data feeder process execution. You will note references to SAS Embedded Process and DS2 in the MapReduce log as shown below.

Log Type: stdout

Log Upload Time: Wed Jun 15 23:22:34 -0400 2016

Log Length: 195

20160615:23.22.25.21: 00000012:WARNING: [01S02]Current catalog set to SASEP (0x80fff8bd) 20160615:23.22.25.60: 00000018:NOTE: All Embedded Process DS2 execution instances completed with SUCCESS.

Log Type: syslog Log

Upload Time: Wed Jun 15 23:22:34 -0400 2016

Log Length: 9273

Showing 4096 bytes of 9273 total. Click here for the full log.

……….

Related reading

If you would like more information, see the Documentation Page.

ronan · ‎09-21-2020

Thanks for sharing. There is also the Multi Node Data Transfer mode which , although running in parallel, doesn't require SAS Embedded Process, only assuming the existence of a column to be partitioned in the source Data :

https://documentation.sas.com/?docsetId=casref&docsetTarget=p12gi5eub04169n1x62tz74kvwt2.htm&docsetV...

Highly effective on Hadoop/hive.

Parallel data load to SAS Viya

Registration is open

SAS AI and Machine Learning Courses