SAS Data Loader for Hadoop 3.1 with Kerberos

2 Likes

The fourth maintenance release of SAS 9.4 includes the latest release of SAS Data Loader for Hadoop. The 3.1 release of SAS Data Loader for Hadoop brings some major changes in architecture and deployment. These changes can be summarized as providing greater integration of SAS Data Loader for Hadoop with the SAS 9.4 platform. In this article we’ll look at what these changes mean for using SAS Data Loader for Hadoop 3.1 with a Secured Hadoop cluster.

Why Kerberos?

So why do we need to think about Kerberos in relation to SAS Data Loader for Hadoop? The majority of organizations will secure their Hadoop Clusters using Kerberos. By using Kerberos the organizations are able to enforce a strong method of authentication for any applications connecting to their Hadoop Cluster. Such strong authentication is the foundation of any form of access control or permissions model. Therefore, if an organization wants to use SAS Data Loader for Hadoop to interact with their Hadoop Cluster they need to use Kerberos.

Using Kerberos authentication when accessing the Hadoop Cluster has implications for the configuration of the SAS Data Loader for Hadoop environment. There is a need to ensure the correct Kerberos credentials are available to the SAS Data Loader for Hadoop application as it interacts with the Hadoop Cluster.

Impact of Architecture Changes

To understand how Kerberos impacts the configuration of SAS Data Loader for Hadoop we need to first examine the architecture of the solution. With SAS Data Loader for Hadoop 3.1 the web application is now integrated into the SAS 9.4 platform. This means that the web application is deployed to the SAS Web Application Server along with the other SAS web applications. Equally this is placed, in most plans, behind the SAS Web Server, which operates as a proxy for the SAS Data Loader web application. The SAS Data Loader web application is deployed to SASServer15 and the configuration automatically drives multiple managed servers. So with a SAS Data Loader for Hadoop deployment you will have SASServer1, SASServer2, and SASServer15.

This addresses the Middle-Tier components; on the SAS Compute Tier we have an additional PostgreSQL instance and the SAS Code Accelerator. Finally on the Hadoop Cluster the following components are deployed: the SAS Embedded Process, SAS Quality Knowledge Base, and SAS Data Loader for Hadoop Spark Engine. The completed set of components are illustrated below:

1_ DLH_Components.png

From the diagram it is clear that in addition to the SAS components we also have the Hadoop Client JARS and Database client software deployed to both the Middle-Tier and Compute Tier. The requirement for these components being on both the Middle-Tier and Compute Tier will become clear when we examine how some of the processing takes place. On the diagram above we also show some of the optional additional components. SAS Data Integration Studio & SAS LASR Analytic Server can be integrated with SAS Data Loader for Hadoop if those SAS offerings have also been licensed.

SAS Data Integration Studio features integration options with SAS Data Loader for Hadoop 3.1, which are new with this release. With the latest version end-users can add Data Loader Directives to SAS Data Integration Studio flows and then submit these flows to the customers chosen scheduling engine. This enables the Data Loader Directives to be built into the customer’s standard DI scheduled activities. Equally the integration with the SAS LASR Analytic Server enables end-users to use Data Loader Directives to load data from Hadoop into SAS LASR Analytic Server. The integration with SAS LASR Analytic Server has been available in previous release. These two combined additional items help to provide a truly enterprise class solution for managing BIG data.

To understand the placement of components and the requirements for Kerberos configuration we identify two common use cases in which the SAS Data Loader interacts with the Hadoop Cluster. In the first use case, we start with a serial load of data. This data could be a SAS Data Set or anything we could access with SAS/ACCESS from the SAS Workspace Server. This serial data load is illustrated here:

With the following steps:

Data Loader web application launches a SAS Workspace Server. The SAS Workspace Server uses SAS/ACCESS engines to connect to both Hadoop and the third party RDBMS. Connections to Hadoop are made via the Hadoop Client JARS and connections to the RDBMS are via the DMBS Client.
SAS/ACCESS connection to the RDBMS to read table metadata
SAS/ACCESS to Hadoop connection to Hive to create table
SAS/ACCESS connection to RDBMS to stream data via the SAS Workspace using SAS/ACCESS to Hadoop to write data to HDFS
SAS/ACCESS to Hadoop connection to Hive to run load data statement

This is one of the more simple cases. We can see that if the Hadoop Cluster is secured by Kerberos then the SAS Workspace Server will need to have access to the end-users Kerberos credentials. Otherwise the SAS/ACCESS to Hadoop connections to HDFS and HIVE will fail. The second use case is for the parallel load of data using leveraging Oozie and Sqoop. This is illustrated here:

With the following steps:

List tables using the DBMS JDBC Drivers from the SAS Middle-Tier
Launch a SAS Workspace Server and use SAS/ACCESS to Hadoop to drop the table from Hive if it exists
Direct connection from SAS Middle-Tier to write Oozie Job information to HDFS
Direct connection from SAS Middle-Tier to Oozie to submit parallel load job using Oozie Rest API
Oozie parallel load job launches Sqoop which uses the JDBC drivers for the RDBMS to connect to the data source and load data to HDFS. Oozie job runs a Hive task to load table for text file format data
For non-Text File format data only, SAS Middle-Tier direct connection to Hive to create table from non-TextFile as select from TextFile

Now we can see the importance of having the components deployed to the Middle-Tier as well as the Compute Tier. The SAS Data Loader web application is making direct connections to the Hadoop Cluster by-passing the Compute Tier altogether. So not only does the SAS Workspace Server need access to the end-user’s Kerberos credentials, the Middle-Tier does as well.

These are just two relatively simple use cases. If you want to better understand more of the processing performed by the SAS Data Loader for Hadoop 3.1 please consult the official documentation..

Requirements

Now we have seen how the SAS Data Loader web application interacts with the Hadoop Cluster we can start to understand the Kerberos requirements for the environment. We need:

Kerberos credentials at the Middle-Tier
Kerberos credentials at the Compute Tier

Therefore, we need to configure the SAS environment where the SAS Data Loader for Hadoop is deployed for complete Kerberos authentication with delegation. This means that all of the SAS tiers need to be configured for Kerberos authentication.

We need to ensure the SAS Compute Tier is able to handle Kerberos authentication. This requires planning and the input from the IT team. There are various options for configuring Kerberos authentication. These relate to the Operating System and Kerberos authentication technology being used. To keep things simple, it is recommended that for Linux based deployments, the use of OS provided components be used to integrate with Microsoft Active Directory, which can be configured to use Kerberos authentication for users.

When configuring the Middle-Tier it is recommended to use the SAS Fallback Authentication module (more details will be provided in a future article). If we make use of the Fallback Module it simplifies the files we need to edit to configure authentication.

As always with the configuration of Kerberos for the SAS environment the majority of work is involved in the prerequisites. Creating the service accounts, registering Service Principal Names, and creating keytabs correctly is where the bulk of effort is required. Since SAS Data Loader for Hadoop 3.1 is going to require Kerberos delegation not only from the Middle-Tier to the Compute Tier, but also from the both tiers to the Hadoop Cluster – special attention should be paid to ensure the service accounts are correctly configured.

More Information

If you would like to explore the configuration of SAS Data Loader for Hadoop please consult with the official documentation.

SAS Data Loader for Hadoop 3.1 with Kerberos

The 2025 SAS Hackathon has begun!

SAS AI and Machine Learning Courses