Challenges of a SAS Deployment into a multi-homing Hadoop cluster

1 Like

Multihoming is a very common thing in large Hadoop clusters. Deploying SAS in Hadoop cluster that uses multihoming can be tricky and raise issues. The purpose of this post is to explore what “Multihoming” means in a Hadoop cluster and how it will impact the SAS integration (especially with the SAS Data Loader for Hadoop).

Symptoms

Sometimes you get a “SAS EP not found error” (proc epadmin), but other times (data step, proc sql, DS2,...) it is hidden in java errors like the following:

ERROR: java.net.ConnectException: Call from [hostname : port] failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

There are good chances that you see this message because :

The hadoop namenode is listening on only one network interface
and/or the client cannot resolve the DataNodes IPs.

In such cases, try to signal this issue to the Hadoop administrator and he should be able to solve it. If there is no Hadoop admin around or if your counterpart has no clue what “Multihoming” means, then this article will hopefully give you the main concepts and helps you to explain the issue to the customer.

Multihoming

This blog will essentially discuss about architecture network considerations. It will help if you are familiar with the concepts of IP-address, ports, connections and firewalls or just want to refresh your knowledge on network architecture.

Multihoming refers to the fact that you can have multiple Ethernet Interfaces and therefore multiple IP-addresses for the same machine.

Figure 1. Multihoming

As illustrated in the diagram above, each host has 2 network interface controllers (NICs) and each interface has a distinct IP address, corresponding to 2 different networks with their own IP ranges: say "green" for the internal communications, "orange" for communications from the outside.

Multihoming is not Hadoop specific but there could be multiple good reasons for doing so in a Hadoop cluster:

Performance: to have a dedicated high bandwidth interconnection (like Fiber Channel, Infiniband or 10GbE) for intra-cluster traffic,
Security/Management: to have a separated/firewalled network for management/client access, etc.
Failover/redundancy(even if it is not really required as it is provided natively by the Hadoop architecture)

Now, the diagram below present a Multihomed Hadoop cluster and some of the associated communications:

Figure 2. Multihomed Hadoop Cluster

The cluster is configured to use the "green" internal network with the IPs in range 10.1.9.x. Aliases defined in /etc/hosts file have been used during the Hadoop cluster setup. Note that by default the Hadoop namenode is ONLY listening for communications sent to the IP 10.1.9.126.

Support of Multihoming in Hadoop

Recent releases of Hadoop supports Multihoming by allowing Hadoop daemons to listen for all network interfaces requests. Specific properties in the server settings and also client-side (in the *-site.xml files) will force binding the wildcard IP address INADDR_ANY i.e. 0.0.0.0. The details can be found on the following pages:

For HDFS : HDFS Support for Multihomed Networks
For YARN/MapReduce : Multihoming on Hadoop YARN clusters

I strongly advise you to take the time to READ THESE PAGES IF YOU HAVE HADOOP COMMUNICATION ISSUES IN A MULTIHOMED HADOOP CLUSTER.

Data Loader for Hadoop with a multihomed Hadoop cluster

Now let’s continue with our diagram and a practical case.

Imagine we want to deploy SAS Data Loader for Hadoop(aka DLH) to interact with our Multihomed cluster.

Let’s assume Data Loader machine, hosting the vApp, is not part of the Hadoop internal network. That could likely happen, given that the vApp can only runs on Windows hosts and Hadoop is on Linux hosts (in 99% of the cases). So the vApp cannot use the "green" network, it can not resolve an IP-address in range 10.1.9.x. In such situation, DLH might struggle to contact the Hadoop services as described below:

Figure 3. Data Loader for Hadoop Communications failures

Note : To keep this scenario as simple as possible we will only represent communications with the HDFS service.

Let's see a little bit more in details the root causes of the communication issues and how to fix them.

Problem 1: Ignore the phone

The first problem occurs when the vApp try to access the HDFS endpoint on the NameNode.

By default HDFS endpoints are specified as either hostnames or IP addresses. BUT in either case HDFS daemons will bind to a single IP address making the daemons unreachable from other.

You can diagnose that, running the “netstat” command and see who is listening on the NameNode port:

[root@sashdp01 ~]# netstat -anp | grep 8020 tcp 0 0 10.1.9.126:8020 0.0.0.0:* LISTEN 13020/java

We can see that the NameNode is only listening for communications sent to the IP 10.1.9.126, it cannot be accessed by a communication on the other IP/interface even if it is the same machine. For example it is like someone with 2 mobile phones, say one for work and one private. But he is only picking up the calls on the private one, although the work phone is ringing, he just ignores it. The addition of the following property in hdfs-site.xml (server and client side) allows the namenode listen to all interfaces for the RPC communications (including our "orange" external IP).

<property> <name>dfs.namenode.rpc-bind-host</name> <value>0.0.0.0</value> </property> </div>

Note: Other properties might be needed for HTTP access, or in case of NameNode HA/Federation. See the Apache Hadoop documentation) Be aware that a HDFS restart will be needed to take this configuration change into account. Run the command below after the Namenode restart to check if the settings for Multihoming are correct for the HDFS service:

[root@sashdp01 ~]# netstat -anp | grep 8020 tcp 0 0 0.0.0.0:8020 0.0.0.0:* LISTEN 14016/java

Problem 2 : “Dead end”

The second problem occurs once the NameNode has been successfully contacted and send back the DataNodes IP addresses to the client to allow it to communicate with the DataNodes. By default HDFS clients connect to DataNodes using the IP address provided by the NameNode.

In case of Multihoming this would be an IP address that only resolves on the "green" network and therefore is unreachable by the SAS client that uses the "orange" network. A common approach is to tell the Hadoop cluster to return host names (instead of IP-addresses) when a client interacts with the cluster, and on the client network have DNS translate that hostname to an IP-address that is resolvable.

The addition of the following property in hdfs-site.xml (server and client side) enables clients to perform their own DNS resolution of the DataNode hostname.

<property> <name>dfs.datanode.use.datanode.hostname</name> <value>true</value> </property>

Note: Setting this property is also described in this SAS Note: Usage Note 55291 and is generally required when you want to interact with the hadoop cluster from a remote client.

Once the two problems have been fixed, the DLH vApp can access the HDFS services via DNS resolution and external interfaces.
Figure 4. Hadoop Communications failures solved

What about YARN/MapReduce communications?

We’re good with HDFS but DLH will also need to access the YARN resource manager and the YARN node managers. Well, the same “Ignore the phone” logic applies to YARN and MapReduce communications.

Setting the below properties to "0.0.0.0" will cause the core YARN and MapReduce daemons to listen on all addresses and interfaces of the hosts in the cluster.

yarn-site.xml

yarn.resourcemanager.bind-host

yarn.nodemanager.bind-host

yarn.timeline-service.bind-host

mapred-site.xml

mapreduce.jobhistory.bind-host

Conclusion

Even if we took the “SAS Data Loader for Hadoop” example in this blog, similar challenges can be encountered with any solution using SAS/ACCESS to Hadoop and/or SAS Code Accelerator for Hadoop (EP/DS2 based).

Very likely when you encounter a Multihomed Hadoop cluster, you will only have to change the settings for the Data node hostname resolution (dfs.datanode.use.datanode.hostname property) as the Hadoop administrator will have configured the rest. Nevertheless when it comes to a complex integration with Hadoop, understanding the implications can help.

That’s it for today! Thanks for reading