We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Remote access to SASHDAT (on HDFS), a new Hadoop scenario in SAS Viya

by SAS Employee RPoumarede on ‎06-23-2017 05:27 AM (556 Views)

The interested observers might have noticed that in the Deployment Guide for SAS Viya 3.2 platform (aka 17w12 ship), there was a new “Hadoop scenario” called “Remote Access to HDFS”.

1.png
 
A 100% accurate name for this deployment case should actually be "Remote Access to SASHDAT (on HDFS)". The purpose of this blog is to introduce this new deployment scenario and discuss the associated requirements and impacts.


Scope and Background

2.png

 
This new capability is only about SASHDAT files on HDFS. As a reminder, SASHDAT is a SAS proprietary format that is optimized for SAS In-Memory Solutions (it was already existing for LASR, but has been updated and improved for CAS, the SAS Viya In-Memory Analytics engine).

This format is an optimized “on-disk” representation of the “in-memory” table. The key benefit is that it allows you to save your CAS table on disk and to quickly reload a CAS table from disk. The main limitation is that the SASHDAT format can only be used by SAS High-Performance engines.

With CAS you can store SASHDAT files:

  • either on your CAS Controller local file system - datasource=(srctype="path")
  • or on a NFS share visible by all your CAS workers - datasource=(srctype="dnfs")
  • or on Hadoop Distributed File System (HDFS), across your Hadoop data nodes - datasource=(srctype="hdfs")


This article will focus on the last case, which corresponds to this new deployment scenario. In this scenario we only interact with the HDFS component of Hadoop, there is no interaction with Hive, MapReduce or any other component of the Hadoop eco-system.


Architecture and requirements

Initially, with LASR (previous generation of the SAS In-Memory engine), in order to use SASHDAT format you had to follow strict co-location rules between the SAS In-memory cluster and the Hadoop cluster components:

  • The SAS In-Memory root node had to be located on the same machine as the Hadoop Name node,
  • And each SAS worker node and Hadoop Data nodes had to co-exist on the same machine.


This model was also often called “symmetric deployment”.

 

Of course this “strictly co-located or symmetric” topology remains available with CAS and is probably the most efficient from a performance perspective as HDFS blocks are directly transferred from memory to disk (and vice-versa), locally on each node without having to travel across network.

 

3.png
 
However, to address various use cases and specific customer requests, these requirements were relaxed for CAS in SAS Viya platform (3.1). It was already possible to work with SASHDAT files in HDFS even if you were not using a strictly co-located:


  • The CAS Controller node could be located outside of the Hadoop cluster
  • The list of CAS worker machines could be a subset of the Hadoop workers.

 

With the latest SAS Viya version (3.2), you can completely “disjoin” the CAS and Hadoop clusters and continue to use the SASHDAT on HDFS file format to store/load efficiently your CAS tables to/from a remote HDFS cluster. There is no need to license any Data connector or Connect Accelerator, as CAS will rely on the SAS Plug-ins for Hadoop® which are included with the software.

 

4.png
 
This scenario is supported with the following Hadoop versions:

 

  • Cloudera CDH 5.8 and later releases
  • Hortonworks HDP 2.5 and later releases
  • MapR 5.1 and later releases
  • Apache Hadoop 0.23, 2.4.0, and 2.7.1 and later versions

The key requirements are:

  • Installation and configuration of the SAS Plug-ins for Hadoop® on each node of the remote Hadoop cluster.
  • The OS user ID of the CAS session process which execute LoadTable or Save (to/from an HDFS caslib), must have password-less SSH from each CAS Node to each Hadoop node.


Whereas the first requirement is nicely documented in the deployment guide (Appendix E: Hadoop Deployment: Configuring CAS SASHDAT Access to HDFS), I thought it could be useful to explain a little bit more about this "Password-less SSH" requirement.


Password-less SSH requirements

As we are not in a co-located deployment model, our CAS Workers will have to talk, across the network, with our Hadoop Data nodes. The communication channel used for that is SSH. This kind of connection requires an authentication, username/password authentication is generally the default.

 

As we need seamless communications we cannot afford to be prompted for each counterpart connection. To avoid that we have to ensure that each SSH connection from any CAS nodes can be done "password-less" to any Hadoop node for our CAS users.

 

The first thing to ensure is that the CAS user account(s) exist in both CAS and Hadoop clusters. Typically the 2 clusters are connected to the same LDAP. Otherwise if you are using local accounts, you might have to create your CAS users in the Hadoop cluster.

 

Then for the password-less SSH mechanism, it can be achieved:

  • either with RSA public keys authentication (generating and distributing source machines account’s public keys to target machines and adding them in authorized keys)
  • or, for Secure mode Hadoop, through GSSAPI with Kerberos.


Test it

Ok, so, whatever the password-less mechanism is, we need to ensure that we can "ssh" from any CAS node to any Hadoop node. For example, imagine our CAS nodes machines are sascas01, sascas02, sascas03 and sascas04, and the Hadoop machines are sashdp01, sashdp02, sashdp03 and sashsp04, then the little script below can be used to ensure that all communications are working without asking any password:

for c in sascas0{1..4}; do ssh -q -o StrictHostKeyChecking=no $c "
echo '--From:'; hostname
echo '--Password-less SSH as ' \`whoami\` 'to:' ; for h in sashdp0{1..4} ; do ssh -q -o StrictHostKeyChecking=no \$h 'echo \`hostname\` ' ;done"; done

 

For which account?

If the script prints a report with all the “From/To” combinations without prompting any password then the test is successful...for the user account that you used to run it.

 

So now the question is: for which user account must this password-less authentication be setup and tested? The short answer is: “for the OS userid, the CAS session runs under”.

 

So actually it depends. With the current version of SAS Viya, Visual login comes through OAuth, and then the CAS session will run under the “cas” account. For SAS Studio, the login comes through username and password and sessions are started as the logged user.

 

So if you are working with SASHDAT tables from Visual interfaces (VA, VDB, EV), then the account that needs password-less SSH access will be “cas”.

But if you are using SAS Studio to perform your operations with SASHDAT, then you will need to enable Password-less SSH access for the account used to logon in SAS Studio (for example, “sasdemo”).

 

Configuration

Once the password-less SSH mechanism is in place for the required accounts, the configuration is pretty straight forward. It is also documented in the installation and configuration guide. When deploying CAS on machines completely separate from the HDFS machines, we will need to revise the variables values (vars.yml file) as follows:

CAS_CONFIGURATION:
env:
HADOOP_NAMENODE: namenode-host-name
HADOOP_HOME: location-of-your-Hadoop-home-directory-on-the-HDFS-server
CAS_ENABLE_REMOTE_SAVE: 1
cfg: colocation: 'hdfs'
mode: 'mpp'


Once it is done, run the ansible SAS Viya playbook to have all the configuration in place to support the remote SASHDAT access.

 

A few other things to know

  • You can store your SASHDAT on HDFS files on only one Hadoop cluster per CAS Cluster instance.

    • All HDFS CASLibs in the server will rely on the same server environment variables which identify the HDFS namenode that we are connected to.
  • We don't need any HDFS-related ports open to CAS in such case (including our service ports 15343 and 15452). Just sshd port (22 by default).

  • Performance: Could be good. Should be roughly equivalent to SAS EP as CAS nodes will receive remote blocks in parallel. It is more related to the speed of Hive/MapReduce vs HDFS. (This is beyond the scope of this current blog and will be a topic I'll look at in later materials).

Words of conclusion

The fact that this mode of deployment is working and supported does NOT necessarily mean it is something you should implement in such “CAS MPP with remote hadoop cluster” environment.

 

Hive is still the “de facto” standard for Hadoop Data warehouses, so your customer will very likely ask you to work with Hive tables that can then be used by other standard-industry third parties.

Another important point is that using Hive generates MapReduce or TEZ tasks that can be under the control of YARN (which is not the case when you perform SASHDAT save/load operations).

 

However depending on the future performance feedbacks and assuming the customer will accept the extra-configuration work (SSH keys), this case might address some specific customer use cases or be a complement to Hive data access.

 

Thank you for reading !

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.