Did you ever ask yourself what happens behind the scenes when you run parallel load from a Hive table into CAS?
The default process is :
This works as long as the collected and provided IP addresses can be resolved and are routable from the EP nodes.
However, if the CAS Server cluster is using an internal network for its own communications, then the remote Hadoop cluster will not be able to communicate with the CAS Server cluster.
This means the EP on the Hadoop/RDBMS cluster will not be able to contact the CAS Server nodes, and the parallel load of data will fail.
Here is a simple but effective analogy:
Internally at SAS we can successfully call our office colleagues using the extension number for their desk phone.
But if you only provide the SAS employee extension number to a customer (who resides outside of the SAS office), the call will fail, as the customer's phone network has no knowledge of the SAS phone network.
Here are examples of scenarios where network configurations need to be factored:
Tip: you can enlarge the picture by clicking on it. Take some time to review this diagram, as there are many details. If the diagram is clear to you, you understand the purpose of this post.
These challenges also exist for SAS 9.4 based LASR/HPA technologies. IP addresses were already used as the "communication information" sent to the EP nodes to allow them to contact back the LASR/TKGrid cluster.
This challenge was recently addressed with SAS 9.4M5 thanks to the "grid.publichosts" configuration file, which was introduced as a new feature within SAS High-Performance Analytics Infrastructure.
This configuration gives a way to distinguish which IP/network interface is used for internal LASR communications and which are used for external communications with the Embedded Process.
The equivalent of the SAS 9.4/TKGrid grid.publichosts in the Viya/CAS world is the new parameter called: cas.DCHOSTNAMERESOLUTION
However, it works in a quite different way from the LASR grid.publichosts internal/external mapping configuration file. Let's have a better look.
This option is available from Viya 3.3 and will work with any Data store where the Embedded Process are deployed (Hadoop/Hive and Teradata with the current release), here is the official administration guide extract:
So what can you do with this CAS option?
By default the IP addresses will be provided by CAS (cas.DCHOSTNAMERESOLUTION='cas'). But if you change this value to 'ep' or 'ep_fqdn', CAS will inform the Embedded Process nodes to use either the names defined in the cas.hosts file or directly the CAS machines FQDN (Fully Qualified Domain Names) such as "casworker1.mycompanyhq-d.openstack.mycompany.com".
So while in LASR/HPA TKGrid with the grid.publichosts option we explicitly associate internal machine and external machine names, with CAS we use the same single set of CAS hosts names.
But as it can be communicated as a hostname - short form or FQDN (and not as an IP address) then it means that on the Hadoop cluster side we can associate a different IP address for those CAS machine hostnames!
Then, the system network in the Hadoop nodes can be configured to ensure that the communicated machines names can be resolved with the appropriate IP addresses (usually through local /etc/hosts files on each EP node).
That's all. Thanks for reading !
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.