Long ago in a cluster far, far away, users accessed Hadoop through various command line interfaces. These interfaces allowed users to create directories and files in HDFS, write complex procedural programs with MapReduce, or work with data through an extensible language known as Pig Latin (or Pig). These program interfaces are still maintained in the various distributions of Hadoop, and for many users are used with their various roles. Pig was introduced specifically for its ease of programming and auto optimization. If you’re more of a HiveQL fan, I have another article you should read SAS/Access to Hadoop, not your parents' access engine.
The problem with these different interfaces has always been their incorporation with the larger world of data management and business intelligence. Native DDL (Data Definition Language) and DML (Data Manipulation Language) tasks through HiveQL and incorporation with the Hive Metastore and HCatalog was a great addition to Hadoop, however the data scientist who owns and maintains programs in other native interfaces may have felt like the Starkiller Base was on the horizon.
How do you run HDFS/MR/PIG commands remotely in tandem while accessing other Hadoop applications?
How can you seamlessly combine Hadoop native programming with other traditional data systems?
You don't need to lead a rebellion to maintain access to a variety of Hadoop programming interfaces. SAS offers a procedure called simply, PROC HADOOP. Available since 9.3M2 as a foundation procedure, PROC HADOOP allows users to incorporate native HDFS/MR/PIG commands into their SAS programs. From an operational standpoint, the configuration is the same as other SAS Hadoop libraries which use environment parameters to provide the necessary connection paths and libraries. IT managers don't need to worry about a two-meter exhaust port; PROC HADOOP configuration will work with various Hadoop security protocols.
You'll find that command syntax is easy to understand:
#hadoop fs –mkdir /user/sasabc/new_directory #hadoop jar /share/hadoop/mapreduce/WordCount.jar wordcount /user/sasabc/architectdoc.txt /user/sasabc/outputtest #pig id.pig
SAS PROC HADOOP:
proc hadoop username='sasabc' password='sasabc' verbose; hdfs mkdir='/user/sasabc/new_directory'; mapreduce input='/user/sasabc/architectdoc.txt' output='/user/sasabc/outputtest' jar='C:\Users\sasabc\Hadoop\jars\WordCount.jar' outputkey='org.apache.hadoop.io.Text' outputvalue='org.apache.hadoop.io.IntWritable' reduce='org.apache.hadoop.examples.WordCount$IntSumReducer' combine='org.apache.hadoop.examples.WordCount$IntSumReducer' map='org.apache.hadoop.examples.WordCount$TokenizerMapper'; pig code=id.pig registerjar='C:\Users\sasabc\Hadoop\jars\myudf.jar'; run;
In these examples notice that the MapReduce and Pig code for execution is stored in this user's Windows directory (SAS is supported on various operating systems). This is the operating system where the user will run SAS, which is remote to the Hadoop cluster.
In the 9.4M3 release of SAS, another option for submitting PROC HADOOP MapReduce and Pig code is through the Apache Oozie RESTful API. This release also includes the PROPERTIES statement which can replace a Hadoop configuration file or act as an enhancement to a configuration file. For instance, you can specify the execution to use a job queue with the PROPERTIES statement (prop ‘mapreduce.job.queuename’=’mygroup’).
Through providing an access path to HDFS/MR/PIG, PROC HADOOP gives users more tools to manage their data in Hadoop. So, before you use the force to run native HDFS commands, or hide inside a Tauntaun rather than run existing Hadoop programs like MapReduce and Pig…try running PROC HADOOP!
Here are some other resources that may be helpful:
Training: Introduction to SAS and Hadoop
And be sure to follow the Data Management section of the SAS Communities Library for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference: