The interested observers might have noticed that in the Deployment Guide for SAS Viya 3.2 platform (aka 17w12 ship), there was a new “Hadoop scenario” called “Remote Access to HDFS”.
A 100% accurate name for this deployment case should actually be "Remote Access to SASHDAT (on HDFS)". The purpose of this blog is to introduce this new deployment scenario and discuss the associated requirements and impacts.
This new capability is only about SASHDAT files on HDFS. As a reminder, SASHDAT is a SAS proprietary format that is optimized for SAS In-Memory Solutions (it was already existing for LASR, but has been updated and improved for CAS, the SAS Viya In-Memory Analytics engine).
This format is an optimized “on-disk” representation of the “in-memory” table. The key benefit is that it allows you to save your CAS table on disk and to quickly reload a CAS table from disk. The main limitation is that the SASHDAT format can only be used by SAS High-Performance engines.
With CAS you can store SASHDAT files:
This article will focus on the last case, which corresponds to this new deployment scenario. In this scenario we only interact with the HDFS component of Hadoop, there is no interaction with Hive, MapReduce or any other component of the Hadoop eco-system.
Initially, with LASR (previous generation of the SAS In-Memory engine), in order to use SASHDAT format you had to follow strict co-location rules between the SAS In-memory cluster and the Hadoop cluster components:
This model was also often called “symmetric deployment”.
Of course this “strictly co-located or symmetric” topology remains available with CAS and is probably the most efficient from a performance perspective as HDFS blocks are directly transferred from memory to disk (and vice-versa), locally on each node without having to travel across network.
However, to address various use cases and specific customer requests, these requirements were relaxed for CAS in SAS Viya platform (3.1). It was already possible to work with SASHDAT files in HDFS even if you were not using a strictly co-located:
With the latest SAS Viya version (3.2), you can completely “disjoin” the CAS and Hadoop clusters and continue to use the SASHDAT on HDFS file format to store/load efficiently your CAS tables to/from a remote HDFS cluster. There is no need to license any Data connector or Connect Accelerator, as CAS will rely on the SAS Plug-ins for Hadoop® which are included with the software.
This scenario is supported with the following Hadoop versions:
The key requirements are:
Whereas the first requirement is nicely documented in the deployment guide (Appendix E: Hadoop Deployment: Configuring CAS SASHDAT Access to HDFS), I thought it could be useful to explain a little bit more about this "Password-less SSH" requirement.
As we are not in a co-located deployment model, our CAS Workers will have to talk, across the network, with our Hadoop Data nodes. The communication channel used for that is SSH. This kind of connection requires an authentication, username/password authentication is generally the default.
As we need seamless communications we cannot afford to be prompted for each counterpart connection. To avoid that we have to ensure that each SSH connection from any CAS nodes can be done "password-less" to any Hadoop node for our CAS users.
The first thing to ensure is that the CAS user account(s) exist in both CAS and Hadoop clusters. Typically the 2 clusters are connected to the same LDAP. Otherwise if you are using local accounts, you might have to create your CAS users in the Hadoop cluster.
Then for the password-less SSH mechanism, it can be achieved:
Ok, so, whatever the password-less mechanism is, we need to ensure that we can "ssh" from any CAS node to any Hadoop node. For example, imagine our CAS nodes machines are sascas01, sascas02, sascas03 and sascas04, and the Hadoop machines are sashdp01, sashdp02, sashdp03 and sashsp04, then the little script below can be used to ensure that all communications are working without asking any password:
If the script prints a report with all the “From/To” combinations without prompting any password then the test is successful...for the user account that you used to run it.
So now the question is: for which user account must this password-less authentication be setup and tested? The short answer is: “for the OS userid, the CAS session runs under”.
So actually it depends. With the current version of SAS Viya, Visual login comes through OAuth, and then the CAS session will run under the “cas” account. For SAS Studio, the login comes through username and password and sessions are started as the logged user.
So if you are working with SASHDAT tables from Visual interfaces (VA, VDB, EV), then the account that needs password-less SSH access will be “cas”.
But if you are using SAS Studio to perform your operations with SASHDAT, then you will need to enable Password-less SSH access for the account used to logon in SAS Studio (for example, “sasdemo”).
Once the password-less SSH mechanism is in place for the required accounts, the configuration is pretty straight forward. It is also documented in the installation and configuration guide. When deploying CAS on machines completely separate from the HDFS machines, we will need to revise the variables values (vars.yml file) as follows:
Once it is done, run the ansible SAS Viya playbook to have all the configuration in place to support the remote SASHDAT access.
Finally you can validate that the remote SASHDAT configuration is correct by saving a CAS table in the remote HDFS, with this kind of code sample :
I know I am repetitive, but I really think the "troubleshooting" section is probably the most useful part of this kind of blogs 🙂 so I'd like to share two issues I faced with this kind of setup. Unfortunately in both cases, we see the same generic error message:
During the remote SASHDAT load, two type of scripts (which are part of the Hadoop plugins component deployed on the remote Hadoop cluster) are executed: "start-namenode-cas-hadoop.sh" and "start-datanode-cas-hadoop.sh" are executed.
Under specific conditions (reboot for example) it is possible that the execution permissions on these script are lost and that it is required to fix them to be able to perform the remote SASHDAT load.
So it is always a good idea check the permissions with a command like :
The other issue was observed with the SASHDAT plugins deployed as parcels packages in Cloudera. We had the same generic error message, but nothing else. So a great tip when debugging the SAS Plugins for Hadoop (even in co-located mode) is to open one of the two scripts discussed above and to activate a trace.
As you can see there is this HADOOP_CAS_STDERR_LOG variable that you can redirect in a temporary location (replace /dev/null with a file location) to see what is exactly the issue.
So once we enabled this trace, we saw an error message complaining about missing JAVA_HOME !
And indeed checking with : "which java" and "echo $JAVA_HOME", we soon realized that our user did not have any way to call java (which is required to launch the command in the start-namenode-cas-hadoop.sh script...).
Cloudera is using a specific "cloudera" named path for his JDK which was not available in the script execution context.
As a consequence we amended all the "start-namenode-cas-hadoop.sh" and "start-datanode-cas-hadoop.sh" scripts on the hadoop nodes by explicitly placing an additional line:
And then the remote SASHDAT load worked like a charm !
Finally, during your troubleshooting, when you perform the remote SASHDAT load you might notice messages like below in your CAS Controller logs:
Don't follow this track, this message is expected in Viya 3.2 with the Remote HDFS topology (as the Hadoop binaries are not present on the CAS machines) and not at all harmful. You can simply ignore them.
The fact that this mode of deployment is working and supported does NOT necessarily mean it is something you should implement in such “CAS MPP with remote hadoop cluster” environment.
Hive is still the “de facto” standard for Hadoop Data warehouses, so your customer will very likely ask you to work with Hive tables that can then be used by other standard-industry third parties.
Another important point is that using Hive generates MapReduce or TEZ tasks that can be under the control of YARN (which is not the case when you perform SASHDAT save/load operations).
However depending on the future performance feedbacks and assuming the customer will accept the extra-configuration work (SSH keys), this case might address some specific customer use cases or be a complement to Hive data access.
Thank you for reading !
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.