Is your organization embracing Hadoop technology? Do you need to understand what Hadoop is and figure out how you can use SAS with Hadoop? If so, this 3-part article series will contain key information to help you start working with Hadoop as a SAS user, a brief overview of the SAS technologies available for Hadoop, and some of the training courses available for each.
A Quick Hadoop Primer
To begin working with Hadoop it's helpful to consider what it has in common as well as how it differs from working in a Microsoft Windows, UNIX or Linux environment. Like those environments, Hadoop has a file system with various applications that read and write files on the file system and process data in memory. Like other file systems, you can store all kinds of file formats in Hadoop, perhaps delimited text files. XML files, JSON files, and various binary files designed for specific applications.
Unlike the other environments, though, Hadoop is not on a single machine. Hadoop consists of software installed on multiple machines communicating across a network so that the system can act as a single unified system. The collection of machines running Hadoop are often referred to as the Hadoop cluster. Any given file you create in Hadoop is spread across all the machines in the cluster (distributed). In other words, a part of the file is on each machine. When you use applications and run programs to process that distributed data, each machine does part of the work with its share of the data. And data is moved around among the machines as needed. Hadoop automatically manages all this so that different parts of the data can be processed simultaneously (parallel processing). This allows you to process large amounts of data in a reasonable amount of time. Since the complexities of distributed storage and parallel processing are handled automatically, this frees up programmers and application users to continue to work in a familiar way, naming input and output files and defining required data management and analytical operations using methods and logic similar to the way they accomplish this in single machine environments.
Hive is one key Hadoop application that will be of interest to some SAS users. With Hive you can use an SQL-like language to define, manage, and query data in Hadoop. Hive uses the Hadoop distributed file system to store the data. Hive is widely adopted by Hadoop users because people can use a familiar SQL based language for data management and querying of data stored in the Hadoop file system.
SAS and Hadoop
BASE SAS and Hadoop: If you use SAS on Windows, Linux, or UNIX you may use a DATA step, PROC IMPORT, PROC EXPORT or other methods to read or write text files on the file system. With the same techniques you can also read or write to files in the Hadoop file system (HDFS). Compare the two programs below. The first reads a comma delimited text file from a Window file system and writes the output to a fixed column format test file on the file system. The second program performs the identical read and write operations using files stored in HDFS.
Very little differs in the program statements when reading and writing HDFS files versus files local to the SAS server. Such read/write operations for files in Hadoop can be a useful technique in certain applications and, as part of BASE SAS, requires no extra software licensing. However there are two limitations to these BASE SAS methods. The data needs to move between Hadoop and SAS because the DATA step executes in SAS. In addition, the DATA step does not perform parallel processing. With the large amounts of data in Hadoop it is often necessary to design applications that allow the data to be processed in parallel within Hadoop. But take heart and read on, as I will describe SAS methods that allow you to do just that.
SAS ACCESS and Hadoop: SAS ACCESS Interface software is available for a number of Data Base Management Systems (for example ORACLE, TERADATA). For each database, the SAS ACCESS technologies provide two types of methods for SAS users to read and write data in database tables. The first method, SQL pass-through, allows users to embed native database SQL queries in the SAS SQL procedure, send those native SQL statements to the database for execution, and have the results returned to SAS. The second method defines a library (LIBNAME) connection to the database which allows users to name database tables as SAS datasets in any SAS program step. With this method, SAS automatically generates native database SQL to query the database tables in a language that the database understands.
Within Hadoop, Hive is an SQL-based application for managing and querying data and thus Hive is the application the SAS ACCESS Interface to HADOOP interacts with to allow users to apply these same well established SQL-based SAS ACCESS techniques to work with data in Hadoop. Compare the two programs below. The first uses SAS ACCESS to ORACLE to process an Oracle table. The second uses SAS ACCESS to Hadoop to process a Hive table:
The only difference is the connection options you specify on the LIBNAME statement.
SAS ACCESS Interface to Hadoop, like all SAS ACCESS technologies, is designed to push as much processing as possible into the Hadoop cluster to maximize efficiency. With SQL pass-through, the native query is sent to Hive for execution and thus the data stays in Hadoop and is processed in parallel. When you use the LIBNAME method to process Hive tables, SAS generates Hive SQL (HiveQL) to request data from Hive. For some SAS language elements, SAS automatically converts those language elements into HiveQL equivalents to maximize processing in Hadoop and to limit the volume of data returned to SAS. Examples include WHERE statements for subsetting, summary calculations performed by a handful of procedures including PROC MEANS and PROC FREQ, and KEEP or DROP dataset options that limit the number of columns returned.
What if you wanted to read and process a Hive table in Hadoop with a DATA step? With the exception of a small set of language elements like WHERE statements and KEEP and DROP dataset options, the DATA step language is not converted into HiveQL equivalent processes by the SAS ACCESS Interface to Hadoop and the data needs to be returned to SAS for further processing. To perform DATA step-like processing on Hive tables in Hadoop we need to go beyond generating HiveQL because of the unique processing capabilities of the DATA step. The solution is DS2 and the Code Accelerator for Hadoop … which we’ll go over in more detail next week in Part 2 on DS2 programming with Hadoop.
Interested in learning more about HiveQL and Hadoop? Check out these resources to dive deeper.