We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Working with SAS and Hadoop: Part 2 - DS2 and the SAS Code Accelerator

by SAS Employee DavidGhan on ‎06-05-2017 10:57 AM - edited on ‎06-14-2017 11:15 AM by Community Manager (1,535 Views)

Welcome back! In my previous post  after a quick primer on Hadoop, we discussed how SAS and Hadoop work in harmony together. This second post will focus more on DS2 and the Code Accelerator for Hadoop.


DS2 and the Code Accelerator for Hadoop: With the SAS DS2 procedure, you use a language closely related to the DATA step. The DS2 procedure is part of BASE SAS and contains most of the language elements of the DATA step. A DS2 program, in contrast to the DATA step, can execute in parallel. With BASE SAS, the parallel processing occurs on the machine where SAS executes. Better yet, you can even send the DS2 code to execute in parallel within the Hadoop cluster if you acquire a Code Accelerator for Hadoop license and name Hive tables as your input datasets in the DS2 programs. The way this works is that the SAS Code Accelerator software is installed in each of the machines of the Hadoop system so that the DS2 code can be executed in parallel in Hadoop where the data resides. Output datasets can be returned to the SAS machine or stored as Hive tables in Hadoop.


There are some syntax differences between a regular DATA step and DS2. But there are similarities as well and once DATA step programmers learn the specifics of the DS2 language they can easily leverage their DATA step programming skills and apply them in DS2 with the same types of statements. Compare these two programs below. The first processes a SAS dataset with a DATA step on the local SAS machine. The second processes a Hive table in parallel inside Hadoop using PROC DS2. Similar statements are highlighted in blue to indicate that the core logic of the DATA step program is contained in the DS2 program using identical code. In the DS2 program, the logic is used to define a thread  called compute. The thread is a stored object that is then declared and called in a subsequent data program (highlighted in yellow).  When executed as a thread, the statements in the thread are sent to execute in each of the Hadoop machines where the hive table (hivelib.customerorders) is stored.

 image 3.PNG


Want to learn more? Dive deeper with these training opportunities and online resources:

And keep an eye out next week for a final post in this series where I will discuss using Hadoop with SAS In-Memory Analytics and SAS Viya.

Your turn
Sign In!

Want to write an article? Sign in with your profile.