SAS® Viya Multi Node Data Save to DBMS

My last post was about SAS Viya “Multi Node Data Transfer” mode, mainly focused around data load from source DBMS to CAS. In this post I talk about saving data from CAS to DBMS using “Multi Node Data Transfer” mode, answering questions such as what is the NUMWRITENODES= parameter; what happens when you specify value 2 for NUMWRITENODES= even though you have 7 CAS worker nodes; and in the case of Hadoop/Hive, does it create MapReduce jobs.

By using “Multi Node Data Transfer” mode, you can save CAS data to a DBMS in addition to loading data to CAS from a source DBMS. In MultiNode Data transfer mode, each CAS Worker makes a simultaneous concurrent connection to read and write data to/from the source DBMS or Distributed data system.

Prerequisites to enable MultiNode Data Transfer include:

The CAS environment is a multi-node environment (multiple CAS Worker Nodes).
The SAS Data Connector for the data source is installed on each CAS Worker and Controller Node.
The data source client connection components are installed on each CAS Worker and Controller Node.
The CASLIB is defined with the relevant parameter (NUMREADNODES= or NUMWRITENODES=) set a value other than 1.

The CAS Controller controls the MultiNode Data transfer. It directs each CAS worker node on how to write CAS data to the DBMS. It directs each CAS worker node to submit an independent concurrent insert statement to write its slice of data. Data is transferred from the CAS worker nodes to the DBMS environment directly using a single thread connection, bypassing the CAS Controller. The CASLIB NUMWRITENODES= parameter determines the number of CAS Worker nodes that will make direct connections to the DBMS.

Let’s consider two scenarios to investigate what happens on a source CAS environment and a target Hadoop environment when CAS data is saved into a hive table.

CAS data save with NUMWRITENODES=0 (using all CAS Worker nodes)
CAS data save with NUMWRITENODES= {< available CAS Worker nodes}

Scenario 1 – CAS data save with NUMWRITENODES=0 (Using all CAS Worker Nodes)

The following pics describe the data save from CAS to DBMS using MultiNode Data transfer Mode. CAS is hosted on a multi-node environment with the SAS Data Connector installed on each node (CAS Controller and Workers). A CASLIB is defined with NUMWRITENODES=0 to use all available CAS worker Nodes. With each data save request, the CAS controller directs each CAS worker node to submit individual insert statement to write its slice of data. Each CAS Worker node stages its data slice into a temporary file/table at DBMS before inserting into actual table. In case of a target hive table, CAS creates a MapReduce job for each CAS worker insert process. On Hadoop cluster, it also created a corresponding MapReduce job for each Insert statement submitted by CAS worker nodes.

The following code example describes the data save using “MultiNode” data transfer mode with all available CAS Worker nodes. It assigns a CASLIB using serial mode with NUMWRITENODES=0 to save data from an existing CAS table. The existing CAS table is evenly distributed across the CAS worker nodes under another predefined CASLIB. It executes a cross CASLIB data save operation. As the NUMWRITENODES= value is other than 1, it follows the MultiNode mechanism. The CAS controller first drops the existing target table and creates a blank table on the target DBMS. It directs each CAS worker node to insert its data slice to the DBMS table. Before inserting the data to the actual table, each thread stages its data into a temporary file/table. In case of hive, the temp data files are located in the HDFS /tmp folder.

An existing CAS table evenly distributed across CAS worker nodes, which is used for data save.

Code to save a CAS table using all available nodes (A cross CAS LIB save):

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
.....

caslib HiveSrlN0  datasource=(srctype="hadoop",
    server="sascdh01.race.sas.com",
    username="hadoop",
	dataTransferMode="SERIAL",
	NUMREADNODES=0,
	NUMWRITENODES=0,
	hadoopconfigdir="/opt/MyHadoop/CDH/Config",
	hadoopjarpath="/opt/MyHadoop/CDH/Jars",
	schema="default");

proc casutil  ;
   save  incaslib="HiveSrl1" outcaslib="HiveSrlN0" 
   casdata="prdsal2_1G" casout="prdsal2_1G_cas"  replace;                                  
run;

SAS Log extract:

….
….
1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 72         
 73         proc casutil  ;
 NOTE: The UUID 'e9547888-e0b0-a74d-aae6-9563121f0568' is connected using session MYSESSION.
 74            save  incaslib="HiveSrl1" outcaslib="HiveSrlN0"
 75            casdata="prdsal2_1G" casout="prdsal2_1G_cas"  replace;
 NOTE: Executing action 'table.save'.
 NOTE: Performing serial SaveTable action using SAS Data Connector to Hadoop.
 NOTE: Cloud Analytic Services saved the file prdsal2_1G_cas in caslib HIVESRLN0.
 NOTE: Action 'table.save' used (Total process time):
 NOTE:       real time               347.188016 seconds
 NOTE:       cpu time                36.546948 seconds (10.53%)
 NOTE:       total nodes             8 (32 cores)
 NOTE:       total memory            250.11G
 NOTE:       memory                  34.38M (0.01%)
 NOTE: The Cloud Analytic Services server processed the request in 347.188016 seconds.
 76         run; …
..
 …
..

On the Database side (Hive in this example), note the active sessions and queries submitted by the CAS Worker Nodes. The CAS environment has 7 CAS workers nodes and each one has an active session to Hive server. (The hosts file is shown below to matchup the IP addresses with the CAS worker nodes)

IP address and server name from CAS environment:
10.96.11.38 sascas01.race.sas.com sascas01 
10.96.8.175 sascas02.race.sas.com sascas02 
10.96.3.120 sascas03.race.sas.com sascas03 
10.96.12.136 sascas04.race.sas.com sascas04 
10.96.10.167 sascas05.race.sas.com sascas05 
10.96.16.181 sascas06.race.sas.com sascas06 
10.96.17.112 sascas07.race.sas.com sascas07 
10.96.11.184 sascas08.race.sas.com sascas08

The Hive UI "Open Query" view shows the "LOAD DATA" statements that pull from the staging files.

Below we see a subset of HDFS staged temp data files, while CAS table save process was running using the Hadoop FS utility:

[hadoop@sascdh01 ~]$ hadoop fs -ls /tmp/sas_*
Found 1 items
-rw-r--r--   3 hadoop supergroup  135026535 2018-04-11 16:42 /tmp/sas_tempbulk-2018-04-11-16-42-50-905/sas_tempbulk-2018-04-11-16-42-51-141_0
Found 1 items
-rw-r--r--   3 hadoop supergroup  135268559 2018-04-11 16:42 /tmp/sas_tempbulk-2018-04-11-16-42-50-907/sas_tempbulk-2018-04-11-16-42-51-144_0

Using the Hadoop Resource Manager Job History User interface, you can see the corresponding MapReduce jobs for each Insert statement submitted by the CAS worker nodes.

Scenario 2 – CAS data save with NUMWRITENODES=2 (w/ 7 worker nodes)

The following pics, code, and other artifacts describe the data save process from CAS to DBMS using the MultiNode Data transfer mode with selective Worker Nodes. CAS is hosted on a multi-node environment with SAS Data Connector installed on each node (CAS Controller and Workers). A CASLIB is defined with NUMWRITENODES=2, to make only 2 direct connection to the DBMS. When value of NUMWRIETNODES= is less than the available number of CAS worker nodes, the CAS Controller randomly select the specified number of Worker Nodes to make direct connections to the DBMS. The CAS Worker Nodes which are not chosen copy their data blocks to next available Worker Node which was chosen. The CAS controller directs each selected CAS worker node to submit individual insert statements to write its data slice (including that received from the unchosen nodes). The CAS Worker nodes stage their data slices to the DBMS in a temporary file/table before inserting into the actual table. In case of hive, CAS creates a MapReduce job for each CAS worker insert process.

Code to Save a CAS table using a subset of available nodes (cross CAS LIB save):

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
.....

caslib HiveSrlN2  datasource=(srctype="hadoop",
    server="sascdh01.race.sas.com",
    username="hadoop",
	dataTransferMode="SERIAL",
	NUMREADNODES=2,
	NUMWRITENODES=2,
	hadoopconfigdir="/opt/MyHadoop/CDH/Config",
	hadoopjarpath="/opt/MyHadoop/CDH/Jars",
	schema="default");

proc casutil  ;
   save  incaslib="HiveSrl1" outcaslib="HiveSrlN2" 
   casdata="prdsal2_1G" casout="prdsal2_1G_cas"  replace;                                  
run;

SAS Log extract:

….
….
72         
 73         proc casutil  ;
 NOTE: The UUID 'e9547888-e0b0-a74d-aae6-9563121f0568' is connected using session MYSESSION.
 74            save  incaslib="HiveSrl1" outcaslib="HiveSrlN2"
 75            casdata="prdsal2_1G" casout="prdsal2_1G_cas"  replace;
 NOTE: Executing action 'table.save'.
 NOTE: Performing serial SaveTable action using SAS Data Connector to Hadoop.
 NOTE: Cloud Analytic Services saved the file prdsal2_1G_cas in caslib HIVESRLN2.
 NOTE: Action 'table.save' used (Total process time):
 NOTE:       real time               97.914763 seconds
 NOTE:       cpu time                40.201150 seconds (41.06%)
 NOTE:       total nodes             8 (32 cores)
 NOTE:       total memory            250.11G
 NOTE:       memory                  22.64M (0.01%)
 NOTE: The Cloud Analytic Services server processed the request in 97.914763 seconds.
 76         run;
 …
..

In this case, we only see 2 workers (as well as the controller) interacting with the Hive environment.

IP address and server name from CAS environment: 
10.96.11.38 sascas01.race.sas.com sascas01 
10.96.8.175 sascas02.race.sas.com sascas02 
10.96.3.120 sascas03.race.sas.com sascas03 
10.96.12.136 sascas04.race.sas.com sascas04 
10.96.10.167 sascas05.race.sas.com sascas05 
10.96.16.181 sascas06.race.sas.com sascas06 
10.96.17.112 sascas07.race.sas.com sascas07 
10.96.11.184 sascas08.race.sas.com sascas08

Observations:

When using “Multi Node Data Transfer” mode to save data to DBMS, plan for considerable amount of temporary space to accommodate the data staging by CAS save process.
When saving to Hive, all CAS worker save requests end up in a MapReduce Queue. The execution of data Insert job can take time depending on Hadoop environment resources and usage.

For more information about this topic:

Using Data Transfer Modes with Data Connectors