My last post was about SAS Viya “Multi Node Data Transfer” mode, mainly focused around data load from source DBMS to CAS. In this post I talk about saving data from CAS to DBMS using “Multi Node Data Transfer” mode, answering questions such as what is the NUMWRITENODES= parameter; what happens when you specify value 2 for NUMWRITENODES= even though you have 7 CAS worker nodes; and in the case of Hadoop/Hive, does it create MapReduce jobs.
By using “Multi Node Data Transfer” mode, you can save CAS data to a DBMS in addition to loading data to CAS from a source DBMS. In MultiNode Data transfer mode, each CAS Worker makes a simultaneous concurrent connection to read and write data to/from the source DBMS or Distributed data system.
The CAS Controller controls the MultiNode Data transfer. It directs each CAS worker node on how to write CAS data to the DBMS. It directs each CAS worker node to submit an independent concurrent insert statement to write its slice of data. Data is transferred from the CAS worker nodes to the DBMS environment directly using a single thread connection, bypassing the CAS Controller. The CASLIB NUMWRITENODES= parameter determines the number of CAS Worker nodes that will make direct connections to the DBMS.
Let’s consider two scenarios to investigate what happens on a source CAS environment and a target Hadoop environment when CAS data is saved into a hive table.
The following pics describe the data save from CAS to DBMS using MultiNode Data transfer Mode. CAS is hosted on a multi-node environment with the SAS Data Connector installed on each node (CAS Controller and Workers). A CASLIB is defined with NUMWRITENODES=0 to use all available CAS worker Nodes. With each data save request, the CAS controller directs each CAS worker node to submit individual insert statement to write its slice of data. Each CAS Worker node stages its data slice into a temporary file/table at DBMS before inserting into actual table. In case of a target hive table, CAS creates a MapReduce job for each CAS worker insert process. On Hadoop cluster, it also created a corresponding MapReduce job for each Insert statement submitted by CAS worker nodes.
The following code example describes the data save using “MultiNode” data transfer mode with all available CAS Worker nodes. It assigns a CASLIB using serial mode with NUMWRITENODES=0 to save data from an existing CAS table. The existing CAS table is evenly distributed across the CAS worker nodes under another predefined CASLIB. It executes a cross CASLIB data save operation. As the NUMWRITENODES= value is other than 1, it follows the MultiNode mechanism. The CAS controller first drops the existing target table and creates a blank table on the target DBMS. It directs each CAS worker node to insert its data slice to the DBMS table. Before inserting the data to the actual table, each thread stages its data into a temporary file/table. In case of hive, the temp data files are located in the HDFS /tmp folder.
An existing CAS table evenly distributed across CAS worker nodes, which is used for data save.
Code to save a CAS table using all available nodes (A cross CAS LIB save):
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
.....
caslib HiveSrlN0 datasource=(srctype="hadoop",
server="sascdh01.race.sas.com",
username="hadoop",
dataTransferMode="SERIAL",
NUMREADNODES=0,
NUMWRITENODES=0,
hadoopconfigdir="/opt/MyHadoop/CDH/Config",
hadoopjarpath="/opt/MyHadoop/CDH/Jars",
schema="default");
proc casutil ;
save incaslib="HiveSrl1" outcaslib="HiveSrlN0"
casdata="prdsal2_1G" casout="prdsal2_1G_cas" replace;
run;
SAS Log extract:
….
….
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 proc casutil ;
NOTE: The UUID 'e9547888-e0b0-a74d-aae6-9563121f0568' is connected using session MYSESSION.
74 save incaslib="HiveSrl1" outcaslib="HiveSrlN0"
75 casdata="prdsal2_1G" casout="prdsal2_1G_cas" replace;
NOTE: Executing action 'table.save'.
NOTE: Performing serial SaveTable action using SAS Data Connector to Hadoop.
NOTE: Cloud Analytic Services saved the file prdsal2_1G_cas in caslib HIVESRLN0.
NOTE: Action 'table.save' used (Total process time):
NOTE: real time 347.188016 seconds
NOTE: cpu time 36.546948 seconds (10.53%)
NOTE: total nodes 8 (32 cores)
NOTE: total memory 250.11G
NOTE: memory 34.38M (0.01%)
NOTE: The Cloud Analytic Services server processed the request in 347.188016 seconds.
76 run; …
..
…
..
On the Database side (Hive in this example), note the active sessions and queries submitted by the CAS Worker Nodes. The CAS environment has 7 CAS workers nodes and each one has an active session to Hive server. (The hosts file is shown below to matchup the IP addresses with the CAS worker nodes)
IP address and server name from CAS environment:
10.96.11.38 sascas01.race.sas.com sascas01
10.96.8.175 sascas02.race.sas.com sascas02
10.96.3.120 sascas03.race.sas.com sascas03
10.96.12.136 sascas04.race.sas.com sascas04
10.96.10.167 sascas05.race.sas.com sascas05
10.96.16.181 sascas06.race.sas.com sascas06
10.96.17.112 sascas07.race.sas.com sascas07
10.96.11.184 sascas08.race.sas.com sascas08
The Hive UI "Open Query" view shows the "LOAD DATA" statements that pull from the staging files.
Below we see a subset of HDFS staged temp data files, while CAS table save process was running using the Hadoop FS utility:
[hadoop@sascdh01 ~]$ hadoop fs -ls /tmp/sas_*
Found 1 items
-rw-r--r-- 3 hadoop supergroup 135026535 2018-04-11 16:42 /tmp/sas_tempbulk-2018-04-11-16-42-50-905/sas_tempbulk-2018-04-11-16-42-51-141_0
Found 1 items
-rw-r--r-- 3 hadoop supergroup 135268559 2018-04-11 16:42 /tmp/sas_tempbulk-2018-04-11-16-42-50-907/sas_tempbulk-2018-04-11-16-42-51-144_0
Using the Hadoop Resource Manager Job History User interface, you can see the corresponding MapReduce jobs for each Insert statement submitted by the CAS worker nodes.
The following pics, code, and other artifacts describe the data save process from CAS to DBMS using the MultiNode Data transfer mode with selective Worker Nodes. CAS is hosted on a multi-node environment with SAS Data Connector installed on each node (CAS Controller and Workers). A CASLIB is defined with NUMWRITENODES=2, to make only 2 direct connection to the DBMS. When value of NUMWRIETNODES= is less than the available number of CAS worker nodes, the CAS Controller randomly select the specified number of Worker Nodes to make direct connections to the DBMS. The CAS Worker Nodes which are not chosen copy their data blocks to next available Worker Node which was chosen. The CAS controller directs each selected CAS worker node to submit individual insert statements to write its data slice (including that received from the unchosen nodes). The CAS Worker nodes stage their data slices to the DBMS in a temporary file/table before inserting into the actual table. In case of hive, CAS creates a MapReduce job for each CAS worker insert process.
Code to Save a CAS table using a subset of available nodes (cross CAS LIB save):
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
.....
caslib HiveSrlN2 datasource=(srctype="hadoop",
server="sascdh01.race.sas.com",
username="hadoop",
dataTransferMode="SERIAL",
NUMREADNODES=2,
NUMWRITENODES=2,
hadoopconfigdir="/opt/MyHadoop/CDH/Config",
hadoopjarpath="/opt/MyHadoop/CDH/Jars",
schema="default");
proc casutil ;
save incaslib="HiveSrl1" outcaslib="HiveSrlN2"
casdata="prdsal2_1G" casout="prdsal2_1G_cas" replace;
run;
SAS Log extract:
….
….
72
73 proc casutil ;
NOTE: The UUID 'e9547888-e0b0-a74d-aae6-9563121f0568' is connected using session MYSESSION.
74 save incaslib="HiveSrl1" outcaslib="HiveSrlN2"
75 casdata="prdsal2_1G" casout="prdsal2_1G_cas" replace;
NOTE: Executing action 'table.save'.
NOTE: Performing serial SaveTable action using SAS Data Connector to Hadoop.
NOTE: Cloud Analytic Services saved the file prdsal2_1G_cas in caslib HIVESRLN2.
NOTE: Action 'table.save' used (Total process time):
NOTE: real time 97.914763 seconds
NOTE: cpu time 40.201150 seconds (41.06%)
NOTE: total nodes 8 (32 cores)
NOTE: total memory 250.11G
NOTE: memory 22.64M (0.01%)
NOTE: The Cloud Analytic Services server processed the request in 97.914763 seconds.
76 run;
…
..
In this case, we only see 2 workers (as well as the controller) interacting with the Hive environment.
IP address and server name from CAS environment:
10.96.11.38 sascas01.race.sas.com sascas01
10.96.8.175 sascas02.race.sas.com sascas02
10.96.3.120 sascas03.race.sas.com sascas03
10.96.12.136 sascas04.race.sas.com sascas04
10.96.10.167 sascas05.race.sas.com sascas05
10.96.16.181 sascas06.race.sas.com sascas06
10.96.17.112 sascas07.race.sas.com sascas07
10.96.11.184 sascas08.race.sas.com sascas08
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.