The interested observers might have noticed that in the Deployment Guide for SAS Viya 3.2 platform (aka 17w12 ship), there was a new “Hadoop scenario” called “Remote Access to HDFS”.
A 100% accurate name for this deployment case should actually be "Remote Access to SASHDAT (on HDFS)". The purpose of this blog is to introduce this new deployment scenario and discuss the associated requirements and impacts.
This new capability is only about SASHDAT files on HDFS. As a reminder, SASHDAT is a SAS proprietary format that is optimized for SAS In-Memory Solutions (it was already existing for LASR, but has been updated and improved for CAS, the SAS Viya In-Memory Analytics engine).
This format is an optimized “on-disk” representation of the “in-memory” table. The key benefit is that it allows you to save your CAS table on disk and to quickly reload a CAS table from disk. The main limitation is that the SASHDAT format can only be used by SAS High-Performance engines.
With CAS you can store SASHDAT files:
This article will focus on the last case, which corresponds to this new deployment scenario. In this scenario we only interact with the HDFS component of Hadoop, there is no interaction with Hive, MapReduce or any other component of the Hadoop eco-system.
Initially, with LASR (previous generation of the SAS In-Memory engine), in order to use SASHDAT format you had to follow strict co-location rules between the SAS In-memory cluster and the Hadoop cluster components:
This model was also often called “symmetric deployment”.
Of course this “strictly co-located or symmetric” topology remains available with CAS and is probably the most efficient from a performance perspective as HDFS blocks are directly transferred from memory to disk (and vice-versa), locally on each node without having to travel across network.
However, to address various use cases and specific customer requests, these requirements were relaxed for CAS in SAS Viya platform (3.1). It was already possible to work with SASHDAT files in HDFS even if you were not using a strictly co-located:
With the latest SAS Viya version (3.2), you can completely “disjoin” the CAS and Hadoop clusters and continue to use the SASHDAT on HDFS file format to store/load efficiently your CAS tables to/from a remote HDFS cluster. There is no need to license any Data connector or Connect Accelerator, as CAS will rely on the SAS Plug-ins for Hadoop® which are included with the software.
This scenario is supported with the following Hadoop versions:
The key requirements are:
Whereas the first requirement is nicely documented in the deployment guide (Appendix E: Hadoop Deployment: Configuring CAS SASHDAT Access to HDFS), I thought it could be useful to explain a little bit more about this "Password-less SSH" requirement.
As we are not in a co-located deployment model, our CAS Workers will have to talk, across the network, with our Hadoop Data nodes. The communication channel used for that is SSH. This kind of connection requires an authentication, username/password authentication is generally the default.
As we need seamless communications we cannot afford to be prompted for each counterpart connection. To avoid that we have to ensure that each SSH connection from any CAS nodes can be done "password-less" to any Hadoop node for our CAS users.
The first thing to ensure is that the CAS user account(s) exist in both CAS and Hadoop clusters. Typically the 2 clusters are connected to the same LDAP. Otherwise if you are using local accounts, you might have to create your CAS users in the Hadoop cluster.
Then for the password-less SSH mechanism, it can be achieved:
Ok, so, whatever the password-less mechanism is, we need to ensure that we can "ssh" from any CAS node to any Hadoop node. For example, imagine our CAS nodes machines are sascas01, sascas02, sascas03 and sascas04, and the Hadoop machines are sashdp01, sashdp02, sashdp03 and sashsp04, then the little script below can be used to ensure that all communications are working without asking any password:
If the script prints a report with all the “From/To” combinations without prompting any password then the test is successful...for the user account that you used to run it.
So now the question is: for which user account must this password-less authentication be setup and tested? The short answer is: “for the OS userid, the CAS session runs under”.
So actually it depends. With the current version of SAS Viya, Visual login comes through OAuth, and then the CAS session will run under the “cas” account. For SAS Studio, the login comes through username and password and sessions are started as the logged user.
So if you are working with SASHDAT tables from Visual interfaces (VA, VDB, EV), then the account that needs password-less SSH access will be “cas”.
But if you are using SAS Studio to perform your operations with SASHDAT, then you will need to enable Password-less SSH access for the account used to logon in SAS Studio (for example, “sasdemo”).
Once the password-less SSH mechanism is in place for the required accounts, the configuration is pretty straight forward. It is also documented in the installation and configuration guide. When deploying CAS on machines completely separate from the HDFS machines, we will need to revise the variables values (vars.yml file) as follows:
Once it is done, run the ansible SAS Viya playbook to have all the configuration in place to support the remote SASHDAT access.
The fact that this mode of deployment is working and supported does NOT necessarily mean it is something you should implement in such “CAS MPP with remote hadoop cluster” environment.
Hive is still the “de facto” standard for Hadoop Data warehouses, so your customer will very likely ask you to work with Hive tables that can then be used by other standard-industry third parties.
Another important point is that using Hive generates MapReduce or TEZ tasks that can be under the control of YARN (which is not the case when you perform SASHDAT save/load operations).
However depending on the future performance feedbacks and assuming the customer will accept the extra-configuration work (SSH keys), this case might address some specific customer use cases or be a complement to Hive data access.
Thank you for reading !