We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Working with multi-homed CAS clusters and the Embedded Process (DCHOSTNAMERESOLUTION new option)

by SAS Employee RPoumarede on ‎03-29-2018 08:23 AM - edited on ‎03-29-2018 09:32 AM by Community Manager (1,779 Views)

Did you ever ask yourself what happens behind the scenes when you run parallel load from a Hive table into CAS?

 

The default process is :

  1. When the CAS Controller launches the EP (Embedded Process) job, it will collect the IP Addresses that "the CAS grid knows itself as" (via making a Host TCP API call on the CAS Grid).
  2. Then when the EP job starts on the Hadoop cluster, the collected IP addresses will be used to contact the CAS workers and start to stream the data blocks.

This works as long as the collected and provided IP addresses can be resolved and are routable from the EP nodes.

 

However, if the CAS Server cluster is using an internal network for its own communications, then the remote Hadoop cluster will not be able to communicate with the CAS Server cluster.

 

This means the EP on the Hadoop/RDBMS cluster will not be able to contact the CAS Server nodes, and the parallel load of data will fail.

 

Here is a simple but effective analogy:

Internally at SAS we can successfully call our office colleagues using the extension number for their desk phone.

But if you only provide the SAS employee extension number to a customer (who resides outside of the SAS office), the call will fail, as the customer's phone network has no knowledge of the SAS phone network. 

 

Here are examples of scenarios where network configurations need to be factored:

  1. CAS grid deployed in infrastructures like Openstack, which uses internal IP addresses, not reachable from the remote Hadoop cluster where you have deployed the SAS Embedded Process.
  2. Multi-homed servers which allow different network traffic to be isolated to ensure consistent and predictable latency and throughput for the servers

 notworking.png

 

 

Tip: you can enlarge the picture by clicking on it. Take some time to review this diagram, as there are many details. If the diagram is clear to you, you understand the purpose of this post.

 

These challenges also exist for SAS 9.4 based LASR/HPA technologies. IP addresses were already used as the "communication information" sent to the EP nodes to allow them to contact back the LASR/TKGrid cluster.

 

This challenge was recently addressed with SAS 9.4M5 thanks to the "grid.publichosts" configuration file, which was introduced as a new feature within SAS High-Performance Analytics Infrastructure.

 

This configuration gives a way to distinguish which IP/network interface is used for internal LASR communications and which are used for external communications with the Embedded Process.

 

The cas.DCHOSTNAMERESOLUTION

 

The equivalent of the SAS 9.4/TKGrid grid.publichosts in the Viya/CAS world is the new parameter called: cas.DCHOSTNAMERESOLUTION

 

However, it works in a quite different way from the LASR grid.publichosts internal/external mapping configuration file. Let's have a better look.

 

This option is available from Viya 3.3 and will work with any Data store where the Embedded Process are deployed (Hadoop/Hive and Teradata with the current release), here is the official administration guide extract:

 

documentedoption_2.png

 

 

So what can you do with this CAS option?

 

By default the IP addresses will be provided by CAS (cas.DCHOSTNAMERESOLUTION='cas'). But if you change this value to 'ep' or 'ep_fqdn', CAS will inform the Embedded Process nodes to use either the names defined in the cas.hosts file or directly the CAS machines FQDN (Fully Qualified Domain Names) such as "casworker1.mycompanyhq-d.openstack.mycompany.com".

 

So while in LASR/HPA TKGrid with the grid.publichosts option we explicitly associate internal machine and external machine names, with CAS we use the same single set of CAS hosts names.

 

But as it can be communicated as a hostname - short form or FQDN (and not as an IP address) then it means that on the Hadoop cluster side we can associate a different IP address for those CAS machine hostnames!

 

Then, the system network in the Hadoop nodes can be configured to ensure that the communicated machines names can be resolved with the appropriate IP addresses (usually through local /etc/hosts files on each EP node).

 

working_3.png

 

That's all. Thanks for reading !

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.