Remote access to SASHDAT (on HDFS), a new Hadoop scenario in SAS Viya

2 Likes

The interested observers might have noticed that in the Deployment Guide for SAS Viya 3.2 platform (aka 17w12 ship), there was a new “Hadoop scenario” called “Remote Access to HDFS”.

A 100% accurate name for this deployment case should actually be "Remote Access to SASHDAT (on HDFS)". The purpose of this blog is to introduce this new deployment scenario and discuss the associated requirements and impacts.

Scope and Background

This new capability is only about SASHDAT files on HDFS. As a reminder, SASHDAT is a SAS proprietary format that is optimized for SAS In-Memory Solutions (it was already existing for LASR, but has been updated and improved for CAS, the SAS Viya In-Memory Analytics engine).

This format is an optimized “on-disk” representation of the “in-memory” table. The key benefit is that it allows you to save your CAS table on disk and to quickly reload a CAS table from disk. The main limitation is that the SASHDAT format can only be used by SAS High-Performance engines.

With CAS you can store SASHDAT files:

either on your CAS Controller local file system - datasource=(srctype="path")
or on a NFS share visible by all your CAS workers - datasource=(srctype="dnfs")
or on Hadoop Distributed File System (HDFS), across your Hadoop data nodes - datasource=(srctype="hdfs")

This article will focus on the last case, which corresponds to this new deployment scenario. In this scenario we only interact with the HDFS component of Hadoop, there is no interaction with Hive, MapReduce or any other component of the Hadoop eco-system.

Architecture and requirements

Initially, with LASR (previous generation of the SAS In-Memory engine), in order to use SASHDAT format you had to follow strict co-location rules between the SAS In-memory cluster and the Hadoop cluster components:

The SAS In-Memory root node had to be located on the same machine as the Hadoop Name node,
And each SAS worker node and Hadoop Data nodes had to co-exist on the same machine.

This model was also often called “symmetric deployment”.

Of course this “strictly co-located or symmetric” topology remains available with CAS and is probably the most efficient from a performance perspective as HDFS blocks are directly transferred from memory to disk (and vice-versa), locally on each node without having to travel across network.

However, to address various use cases and specific customer requests, these requirements were relaxed for CAS in SAS Viya platform (3.1). It was already possible to work with SASHDAT files in HDFS even if you were not using a strictly co-located:

The CAS Controller node could be located outside of the Hadoop cluster
The list of CAS worker machines could be a subset of the Hadoop workers.

With the latest SAS Viya version (3.2), you can completely “disjoin” the CAS and Hadoop clusters and continue to use the SASHDAT on HDFS file format to store/load efficiently your CAS tables to/from a remote HDFS cluster. There is no need to license any Data connector or Connect Accelerator, as CAS will rely on the SAS Plug-ins for Hadoop® which are included with the software.

This scenario is supported with the following Hadoop versions:

Cloudera CDH 5.8 and later releases
Hortonworks HDP 2.5 and later releases
MapR 5.1 and later releases
Apache Hadoop 0.23, 2.4.0, and 2.7.1 and later versions

The key requirements are:

Installation and configuration of the SAS Plug-ins for Hadoop® on each node of the remote Hadoop cluster.
The OS user ID of the CAS session process which execute LoadTable or Save (to/from an HDFS caslib), must have password-less SSH from each CAS Node to each Hadoop node.

Whereas the first requirement is nicely documented in the deployment guide (Appendix E: Hadoop Deployment: Configuring CAS SASHDAT Access to HDFS), I thought it could be useful to explain a little bit more about this "Password-less SSH" requirement.

Password-less SSH requirements

As we are not in a co-located deployment model, our CAS Workers will have to talk, across the network, with our Hadoop Data nodes. The communication channel used for that is SSH. This kind of connection requires an authentication, username/password authentication is generally the default.

As we need seamless communications we cannot afford to be prompted for each counterpart connection. To avoid that we have to ensure that each SSH connection from any CAS nodes can be done "password-less" to any Hadoop node for our CAS users.

The first thing to ensure is that the CAS user account(s) exist in both CAS and Hadoop clusters. Typically the 2 clusters are connected to the same LDAP. Otherwise if you are using local accounts, you might have to create your CAS users in the Hadoop cluster.

Then for the password-less SSH mechanism, it can be achieved:

either with RSA public keys authentication (generating and distributing source machines account’s public keys to target machines and adding them in authorized keys)
or, for Secure mode Hadoop, through GSSAPI with Kerberos.

Test it

Ok, so, whatever the password-less mechanism is, we need to ensure that we can "ssh" from any CAS node to any Hadoop node. For example, imagine our CAS nodes machines are sascas01, sascas02, sascas03 and sascas04, and the Hadoop machines are sashdp01, sashdp02, sashdp03 and sashsp04, then the little script below can be used to ensure that all communications are working without asking any password:

for c in sascas0{1..4}; do ssh -q -o StrictHostKeyChecking=no $c " echo '--From:'; hostname echo '--Password-less SSH as ' \`whoami\` 'to:' ; for h in sashdp0{1..4} ; do ssh -q -o StrictHostKeyChecking=no \$h 'echo \`hostname\` ' ;done"; done

For which account?

If the script prints a report with all the “From/To” combinations without prompting any password then the test is successful...for the user account that you used to run it.

So now the question is: for which user account must this password-less authentication be setup and tested? The short answer is: “for the OS userid, the CAS session runs under”.

So actually it depends. With the current version of SAS Viya, Visual login comes through OAuth, and then the CAS session will run under the “cas” account. For SAS Studio, the login comes through username and password and sessions are started as the logged user.

So if you are working with SASHDAT tables from Visual interfaces (VA, VDB, EV), then the account that needs password-less SSH access will be “cas”.

But if you are using SAS Studio to perform your operations with SASHDAT, then you will need to enable Password-less SSH access for the account used to logon in SAS Studio (for example, “sasdemo”).

Configuration and Validation

Once the password-less SSH mechanism is in place for the required accounts, the configuration is pretty straight forward. It is also documented in the installation and configuration guide. When deploying CAS on machines completely separate from the HDFS machines, we will need to revise the variables values (vars.yml file) as follows:

CAS_CONFIGURATION: 
  env: 
    HADOOP_NAMENODE: namenode-host-name 
    HADOOP_HOME: location-of-your-Hadoop-home-directory-on-the-HDFS-server 
    CAS_ENABLE_REMOTE_SAVE: 1 
  cfg:

    colocation: 'hdfs' 
    mode: 'mpp'

Once it is done, run the ansible SAS Viya playbook to have all the configuration in place to support the remote SASHDAT access.

Finally you can validate that the remote SASHDAT configuration is correct by saving a CAS table in the remote HDFS, with this kind of code sample :

cas mysession;

caslib testhdat datasource=(srctype="hdfs") path="/sastest";

proc casutil;

load data=sashelp.zipcode;

save casdata="zipcode" replace;

load casdata="zipcode.sashdat" outcaslib="casuserhdfs" casout="working_zipcode";

run;

quit;

cas mysession terminate;

Troubleshooting

I know I am repetitive, but I really think the "troubleshooting" section is probably the most useful part of this kind of blogs 🙂 so I'd like to share two issues I faced with this kind of setup. Unfortunately in both cases, we see the same generic error message:

ERROR: Could not connect to the Hadoop cluster. ERROR: The action stopped due to errors.

SAS Plugins scripts Permissions (with Hortonworks HDP 2.5)

During the remote SASHDAT load, two type of scripts (which are part of the Hadoop plugins component deployed on the remote Hadoop cluster) are executed: "start-namenode-cas-hadoop.sh" and "start-datanode-cas-hadoop.sh" are executed.

Under specific conditions (reboot for example) it is possible that the execution permissions on these script are lost and that it is required to fix them to be able to perform the remote SASHDAT load.

So it is always a good idea check the permissions with a command like :

# for h in sashdp0{1..4};do ssh $h "hostname;ls -l /usr/hdp/current/hadoop-client/bin/start*cas*sh";done

Missing JAVA_HOME (with Cloudera 5.10)

The other issue was observed with the SASHDAT plugins deployed as parcels packages in Cloudera. We had the same generic error message, but nothing else. So a great tip when debugging the SAS Plugins for Hadoop (even in co-located mode) is to open one of the two scripts discussed above and to activate a trace.

$ cat start-namenode-cas-hadoop.sh

#!/bin/sh

if [ "$1" ];

then export HADOOP_HOME=$1

fi

cd $HADOOP_HOME

if [ "$HADOOP_CAS_STDERR_LOG" = "" ]

  then export HADOOP_CAS_STDERR_LOG=/dev/null

fi

bin/hadoop 2>$HADOOP_CAS_STDERR_LOG com.sas.cas.hadoop.NameNodeService

As you can see there is this HADOOP_CAS_STDERR_LOG variable that you can redirect in a temporary location (replace /dev/null with a file location) to see what is exactly the issue.

So once we enabled this trace, we saw an error message complaining about missing JAVA_HOME !

And indeed checking with : "which java" and "echo $JAVA_HOME", we soon realized that our user did not have any way to call java (which is required to launch the command in the start-namenode-cas-hadoop.sh script...).

Cloudera is using a specific "cloudera" named path for his JDK which was not available in the script execution context.

As a consequence we amended all the "start-namenode-cas-hadoop.sh" and "start-datanode-cas-hadoop.sh" scripts on the hadoop nodes by explicitly placing an additional line:

export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera

And then the remote SASHDAT load worked like a charm !

Finally, during your troubleshooting, when you perform the remote SASHDAT load you might notice messages like below in your CAS Controller logs:

/opt/sas/viya/home/SASFoundation/utilities/bin/start-namenode-cas-hadoop.sh: line 7: cd: /opt/cloudera/parcels/CDH/lib/hadoop: No such file or directory

Don't follow this track, this message is expected in Viya 3.2 with the Remote HDFS topology (as the Hadoop binaries are not present on the CAS machines) and not at all harmful. You can simply ignore them.

A few other things to know

You can store your SASHDAT on HDFS files on only one Hadoop cluster per CAS Cluster instance.
All HDFS CASLibs in the server will rely on the same server environment variables which identify the HDFS namenode that we are connected to.
We don't need any HDFS-related ports open to CAS in such case (including our service ports 15343 and 15452). Just sshd port (22 by default).
Performance: Could be good. Should be roughly equivalent to SAS EP as CAS nodes will receive remote blocks in parallel. It is more related to the speed of Hive/MapReduce vs HDFS. (This is beyond the scope of this current blog and will be a topic I'll look at in later materials).

Words of conclusion

The fact that this mode of deployment is working and supported does NOT necessarily mean it is something you should implement in such “CAS MPP with remote hadoop cluster” environment.

Hive is still the “de facto” standard for Hadoop Data warehouses, so your customer will very likely ask you to work with Hive tables that can then be used by other standard-industry third parties.

Another important point is that using Hive generates MapReduce or TEZ tasks that can be under the control of YARN (which is not the case when you perform SASHDAT save/load operations).

However depending on the future performance feedbacks and assuming the customer will accept the extra-configuration work (SSH keys), this case might address some specific customer use cases or be a complement to Hive data access.

Thank you for reading !