BookmarkSubscribeRSS Feed

Hello YARN! I am CAS, can we play together?

Started ‎06-26-2017 by
Modified ‎12-05-2017 by
Views 2,069

YARN (Yet Another Resource Negotiator) is the Hadoop resource manager, its primary goal is to ensure that every actor has a fair share (or whatever share which was planned for him) of the resources. He is the one you have to talk with before asking the Hadoop cluster to process something for you.

Several of our 9.4 products are working from/in/with Hadoop (SAS/ACCESS to Hive, In-Database accelerators for Hadoop, HPA/LASR based solutions) and have integration capabilities with YARN.

In this article, we will see how CAS (our new High-Performance Analytics engine for Viya) can be a good hadoop “citizen” for its data loading and analytics actions. The discussion here is mostly relevant for situations where CAS Nodes and Hadoop nodes are sharing the same infrastructure.

 

How does CAS use YARN?

 

There is no “true” and dynamic integration between CAS and YARN because individual CAS jobs will not be run in YARN containers and be under the control of YARN. However, in situations where your CAS server is deployed inside the Hadoop cluster there are ways to ease the coexistence of CAS jobs and other Hadoop jobs (Mapreduce, Spark, Impala, etc…) both being generally "memory-intensive" and fighting for resource...

1.png
 
CAS, at start-up time, can be restricted to a certain amount of memory and “reserve” this amount of memory against YARN as long as it is running.

Two properties will be used to reserve memory and notify the YARN Application Master : cas.useyarn and cas.memorysize.

These properties are documented in the Viya Administration Guide

The first property tells CAS to work with YARN and the second one gives the maximum amount of memory that each CAS node can use.


A Resource sharing example

Before choosing the amount of memory you want to reserve for SAS, you might want to know the amount of memory available in the Hadoop cluster.


The property yarn.nodemanager.resource.memory-mb will tell you how much memory can be used by YARN on each node manager.

The Hadoop administration console will show you this value, for example in Cloudera Manager:

 

2.png
 
You can also get this information from the Resource Manager Web UI:

 

 

3.png
 
So, for example, to leave some resources to other hadoop jobs, we can decide to allocate 2 GB of memory per node to CAS in our case. Then Hadoop still has a total of 18GB of ram for other requests. We add the 2 rows below in the CAS configuration file (casconfig.lua) :


cas.useyarn= true cas.memorysize= '2g'


Like in the "old" TKGrid world, make sure your changes in this file are propagated on all nodes, for example you could run the ansible command below, to copy the file accross all the nodes :

# ansible all -m copy -a "src=/opt/sas/viya/config/etc/cas/default/casconfig.lua dest=/opt/sas/viya/config/etc/cas/default/casconfig.lua owner=cas group=sas mode=0644" -b


Then we can restart the CAS controller:

[root@gatekrbhdp01 ~]# systemctl restart sas-viya-cascontroller-default.service


We can use the CAS Server monitor to make sure that the new properties have been taken into account:


4.png

 

As soon as we restart the CAS Controller, a YARN application will be started in the Hadoop cluster and "pre-empt" the amount of memory that we specified.

 

5.png
 
The application will not progress but always stays at 50% of progress for the whole CAS Server lifecycle.

We can also notice that a linux CGroup has been created to restrict the memory usage of the CAS process at the system level.

[root@gatekrbhdp02 sas_viya_playbook]# ansible all -m shell -a "ls -l /sys/fs/cgroup/memory/ | grep CAS"
viyaserver2 | SUCCESS | rc=0 >> drwxr-xr-x 2 root sas 0 Oct 5 04:14 CASMemory_28001
viyaserver3 | SUCCESS | rc=0 >> drwxr-xr-x 2 root sas 0 Oct 5 04:14 CASMemory_28001
viyaserver4 | SUCCESS | rc=0 >> drwxr-xr-x 2 root sas 0 Oct 5 04:14 CASMemory_28001
viyaserver1 | SUCCESS | rc=0 >> drwxr-xr-x 2 root sas 0 Oct 5 04:13 CASMemory_28001


Now let's test in SAS Studio, for example with the “Supervised Learning” snippet and see what happens.


6.png
 
When we open a CAS session and run the program, nodes activities are reported in the Resource monitor:

7.png
 
We notice that the memory usage for jobs stay below the memory size defined (4x2GB).

On the Hadoop Resource Manager UI, nothing happens as we already have a “super” application in state “RUNNING” for the whole CAS server lifetime.

Simply, when we stop the CAS Server, the YARN application is killed and the status change from “RUNNING” to “SUCCEEDED”


YARN queue for data loading in CAS


CAS does not process data directly in Hadoop.
Instead CAS will load data from Hadoop (Hive or HDFS via HDMD) into the CAS In-Memory Server.

This initial data load leverages MapReduce processing in the hadoop cluster, so it can be under the control of YARN via the YARN queues.

For the purpose of this article we create a queue called “CAS” (they are called “pools” in the Cloudera Manager but actually the 2 terms are interchangeable) and use it.


8.png

 
Serial load

If the Data connector (SAS/ACCESS type integration) is used, then the Hadoop client on the CAS controller will establish a single connection to the Hadoop cluster, collect the data in Hadoop then distribute it across the CAS nodes.

Note: Depending on the type of operation, the size of the data and the Hive tuning, Hive can use “fetch” actions or HDFS and metadata operations instead of executing a MapReduce job. This will drastically improve the efficiency of the processing but also make those actions invisible for YARN (see this paper for additional details on ways to optimize SAS with Hive).

For example “proc casutil content” and a small table load in CAS are cases, where you will not see any MapReduce job in the YARN Resource Manager console.

Back to our example, to make sure that a MapReduce job will be triggered we apply a filter on the hive table (with the dbmsWhere option) before loading it into CAS. To perform the filter, Hive runs a mapreduce job.

caslib hivelib desc="HIVE Caslib" datasource=(SRCTYPE="hadoop",SERVER="gatekrbhdp01.gatehadoop.com", HADOOPCONFIGDIR="/opt/sas/hadoop/client_conf", HADOOPJARPATH="/opt/sas/hadoop/client_jar", schema="default",dfDebug=sqlinfo, properties="mapred.job.queue.name=CAS");
proc casutil ;
load casdata="stocks" casout="stocks" options=(dbmswhere="open>100");
quit;


In the caslib definition, we use the “properties” option to specify the YARN queue that we want to use. Now if we look at the Resource manager, we can see that our load operation has been assigned to the “CAS” queue defined earlier.


9.png


Parallel load


If the Data Connect Accelerators (In-Database type integration) are used, then parallel connections are established between Hadoop nodes (where SAS Embedded Process have been deployed) and CAS worker nodes, allowing a much more effective parallel load. This time, the properties option in the CASLib is not enough, we have to specify our CAS queue directly by adding the 4 lines below in the hadoop client configuration file: mapred-site.xml

[root@gatekrbhdp01 hadoop]# vi client_confCAS/mapred-site.xml
<property>
<name>mapreduce.job.queuename</name>
<value>root.CAS</value>
</property>


Now we can test with the following program:

/* Assign EP HIVE CASLIB */
caslib hiveEP clear;
caslib hiveEP datasource=(srctype="hadoop" server="gatekrbhdp01.gatehadoop.com" dataTransferMode="parallel" hadoopconfigdir="/opt/sas/hadoop/client_confCAS" hadoopjarpath="/opt/sas/hadoop/client_jar");
/* Load Table in-Parallel */
proc casutil; load casdata="stocks" casout="stocks"; quit;


The parallel load is assigned in the CAS YARN queue :


10.png
 


Conclusion

 

11.png

 

As discussed in this article, don't forget that CAS has several ways to integrate nicely with Hadoop via YARN, either at the CAS Service level (on start-up time) for shared cluster, or dynamically for the data loading operations.

Thanks for reading !

Version history
Last update:
‎12-05-2017 04:52 AM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags