SAS HADOOP procedure: Managing data in Hadoop is the first order

Long ago in a cluster far, far away, users accessed Hadoop through various command line interfaces. These interfaces allowed users to create directories and files in HDFS, write complex procedural programs with MapReduce, or work with data through an extensible language known as Pig Latin (or Pig). These program interfaces are still maintained in the various distributions of Hadoop, and for many users are used with their various roles. Pig was introduced specifically for its ease of programming and auto optimization. If you’re more of a HiveQL fan, I have another article you should read SAS/Access to Hadoop, not your parents' access engine.

The problem with these different interfaces has always been their incorporation with the larger world of data management and business intelligence. Native DDL (Data Definition Language) and DML (Data Manipulation Language) tasks through HiveQL and incorporation with the Hive Metastore and HCatalog was a great addition to Hadoop, however the data scientist who owns and maintains programs in other native interfaces may have felt like the Starkiller Base was on the horizon.

How do you run HDFS/MR/PIG commands remotely in tandem while accessing other Hadoop applications?

How can you seamlessly combine Hadoop native programming with other traditional data systems?

You don't need to lead a rebellion to maintain access to a variety of Hadoop programming interfaces. SAS offers a procedure called simply, PROC HADOOP. Available since 9.3M2 as a foundation procedure, PROC HADOOP allows users to incorporate native HDFS/MR/PIG commands into their SAS programs. From an operational standpoint, the configuration is the same as other SAS Hadoop libraries which use environment parameters to provide the necessary connection paths and libraries. IT managers don't need to worry about a two-meter exhaust port; PROC HADOOP configuration will work with various Hadoop security protocols.

You'll find that command syntax is easy to understand:

Native Commands:

#hadoop fs –mkdir /user/sasabc/new_directory

#hadoop jar /share/hadoop/mapreduce/WordCount.jar wordcount /user/sasabc/architectdoc.txt /user/sasabc/outputtest

#pig id.pig

SAS PROC HADOOP:

proc hadoop username='sasabc' password='sasabc' verbose;

   hdfs mkdir='/user/sasabc/new_directory';

 

   mapreduce input='/user/sasabc/architectdoc.txt'

     output='/user/sasabc/outputtest'

     jar='C:\Users\sasabc\Hadoop\jars\WordCount.jar'

     outputkey='org.apache.hadoop.io.Text'

     outputvalue='org.apache.hadoop.io.IntWritable'

     reduce='org.apache.hadoop.examples.WordCount$IntSumReducer'

     combine='org.apache.hadoop.examples.WordCount$IntSumReducer'

     map='org.apache.hadoop.examples.WordCount$TokenizerMapper';

 

pig code=id.pig registerjar='C:\Users\sasabc\Hadoop\jars\myudf.jar';

 

run;

In these examples notice that the MapReduce and Pig code for execution is stored in this user's Windows directory (SAS is supported on various operating systems). This is the operating system where the user will run SAS, which is remote to the Hadoop cluster.

In the 9.4M3 release of SAS, another option for submitting PROC HADOOP MapReduce and Pig code is through the Apache Oozie RESTful API. This release also includes the PROPERTIES statement which can replace a Hadoop configuration file or act as an enhancement to a configuration file. For instance, you can specify the execution to use a job queue with the PROPERTIES statement (prop ‘mapreduce.job.queuename’=’mygroup’).

Through providing an access path to HDFS/MR/PIG, PROC HADOOP gives users more tools to manage their data in Hadoop. So, before you use the force to run native HDFS commands, or hide inside a Tauntaun rather than run existing Hadoop programs like MapReduce and Pig…try running PROC HADOOP!

Here are some other resources that may be helpful:

Training: Introduction to SAS and Hadoop

Webinars: Getting Started with SAS and Hadoop, SAS Integration with Hadoop: Part II

And be sure to follow the Data Management section of the SAS Communities Library for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference:

AndreiMitiaev · ‎02-09-2016

Would it work in the same way with KNOX implemented on Hadoop?

Thanks,

Andrei Mitiaev

ACBradley · ‎02-11-2016

Hi Andrei,

In SAS 9.4M3 the "SAS_HADOOP_RESTFUL" option allows PROC HADOOP to communicate with the Hadoop REST API. HDFS commands go through WebHDFS and MR/PIG would be submitted with the Oozie REST API. This would require some setup to have the MR and/or PIG jobs accessible to the Oozie ShareLib.

For more information, you can find syntax and examples here:

https://support.sas.com/documentation/cdl/en/proc/68954/HTML/default/viewer.htm#titlepage.htm

ajain59 · ‎03-22-2017

Hi,

Below link is not working.

https://support.sas.com/documentation/cdl/en/proc/68954/HTML/default/viewer.htm

Can you please provide the correct URL.

ACBradley · ‎03-27-2017

Hi ajain59,

This link will take you to the SAS(r) 9.4 Base Procedures Guide main page:

http://support.sas.com/documentation/cdl/en/proc/70377/HTML/default/viewer.htm#titlepage.htm

On the left side of the screen, there is a listing of all the SAS procedures, including PROC HADOOP. I'm going to include a direct link to that procedure below, however, that link contains a hash value that will likely change with version and maintenance releases. You should always be able to navigate to PROC HADOOP through the above link.

http://support.sas.com/documentation/cdl/en/proc/70377/HTML/default/viewer.htm#p0esxx8qmpi2p8n1mdmwn...

Please let me know if you have any further questions and thanks for reading!

Clark

SAS HADOOP procedure: Managing data in Hadoop is the first order

Ready to join fellow brilliant minds for the SAS Hackathon?

Free course: Data Literacy Essentials

Get Started