SAS with Hadoop performances : Best practices and lessons learned

1 Like

With the increasing number of customers, in all industry sectors, implementing Hadoop clusters (not only for PoC but also in production now) and the increasing number of our products integrating with Hadoop (Data loader for Hadoop, SPDE/S on Hadoop, SAS Grid manager for Hadoop, High-Performance Data Mining, etc...), I thought it could be a good idea to write a paper gathering some best practices and lessons learned from the field on the performance topic.

Moving away from a traditional SAS environment, with a performant SAN storage to a full "SAS on Hadoop" platform does not necessarily mean performance gain (it might even mean the opposite).

However, Hadoop can bring multiple benefits to the customer including lower cost, scalability and resilience. The paper intends to give some guidance to allow SAS to be as efficient as possible when integration with a Hadoop ecosystem. The paper’s objectives and content are outlined below. If you want to download the paper it can be found on the SAS support web site : here

Push it down in Hadoop !

A very general recommendation when SAS interacts with external Data store is to avoid the download of remote data to the SAS compute server/client and to use “In-database” processing instead. Using SAS In-database processing, you can run scoring models, some SAS procedures, DS2 thread programs, and formatted SQL queries inside the data source.

This same recommendation is even more important in the Hadoop world, as the volume of data stored in Hadoop can be massive (source and target tables are in the hundreds of gigabytes if not terabytes and only a Hadoop Platform with many, many worker nodes will be able to crunch the data in acceptable time).

Bringing back the data to the SAS server will result in poor performance and depending on the volume of the data may cause the SAS host to run very low on computing resources.

So our main goal will be to push down most of the data processing in Hadoop. We will talk about “In-Hadoop” processing.

Simply using the best practices and optimization technics you can make sure your data management operations (PROC SQL, basic PROCS or DATA STEP) will be successfully converted by the SAS/ACCESS engine to run inside Hadoop even if you don’t have the “SAS Embedded Process” deployed.

However, products like SAS In-Database Code Accelerator (SAS Data Loader for Hadoop), SAS Scoring Accelerator for Hadoop (both relying on SAS Embedded Process) will bring new capabilities not only to process and clean the data but also to run analytics directly inside Hadoop.

Content of the paper

The first chapter of the document will focus on SAS/ACCESS best practices and tips, to make sure that the bulk of the data management operations that can be done by the Hadoop cluster will indeed be done by Hadoop processing framework (leveraging distributed processing across the Hadoop nodes). It will also discuss some limitations in HIVE, new popular formats in Hadoop (Avro, Parquet, ORC, etc...) and partitioning.
In the second chapter, we will assume that the SAS embedded process have been deployed inside the Hadoop cluster and we will focus on the way to leverage them to run SAS analytics operations directly “In-Hadoop”.
In the 3rd chapter, lessons learned from several field experiences and feedbacks are provided, coming from various PoC (“Proof of Concept”) or project experiences.
Note: Big thanks to my PSD colleagues from various places (SAS Belgium, SAS Netherlands, SAS Portugal, SAS Philippines) for sharing their experience and knowledge from various PoCs and projects.
The last chapter will give the main options and tools to monitor (and eventually troubleshoot) SAS Analytics processing in Hadoop.
Finally the results of performance tests using alternative file types in HDFS/HIVE (Hive table stored as text, SPDE, Hive tables stored as ORC, AVRO, HDMD file, and SASHDAT) are presented.

Conclusion

The paper is not an exhaustive guide on all possible optimizations, but rather a collection of tricks and best practices reminders coming from the field experience. It will hopefully allow quick gains for consultant on the ground when performance issues arise in a SAS with/on/in Hadoop environment. The paper is available on the SAS support web site : here