With the increasing number of customers, in all industry sectors, implementing Hadoop clusters (not only for PoC but also in production now) and the increasing number of our products integrating with Hadoop (Data loader for Hadoop, SPDE/S on Hadoop, SAS Grid manager for Hadoop, High-Performance Data Mining, etc...), I thought it could be a good idea to write a paper gathering some best practices and lessons learned from the field on the performance topic.
Moving away from a traditional SAS environment, with a performant SAN storage to a full "SAS on Hadoop" platform does not necessarily mean performance gain (it might even mean the opposite).
However, Hadoop can bring multiple benefits to the customer including lower cost, scalability and resilience. The paper intends to give some guidance to allow SAS to be as efficient as possible when integration with a Hadoop ecosystem. The paper’s objectives and content are outlined below. If you want to download the paper it can be found on the SAS support web site : here
A very general recommendation when SAS interacts with external Data store is to avoid the download of remote data to the SAS compute server/client and to use “In-database” processing instead. Using SAS In-database processing, you can run scoring models, some SAS procedures, DS2 thread programs, and formatted SQL queries inside the data source.
This same recommendation is even more important in the Hadoop world, as the volume of data stored in Hadoop can be massive (source and target tables are in the hundreds of gigabytes if not terabytes and only a Hadoop Platform with many, many worker nodes will be able to crunch the data in acceptable time).
Bringing back the data to the SAS server will result in poor performance and depending on the volume of the data may cause the SAS host to run very low on computing resources.
So our main goal will be to push down most of the data processing in Hadoop. We will talk about “In-Hadoop” processing.
Simply using the best practices and optimization technics you can make sure your data management operations (PROC SQL, basic PROCS or DATA STEP) will be successfully converted by the SAS/ACCESS engine to run inside Hadoop even if you don’t have the “SAS Embedded Process” deployed.
However, products like SAS In-Database Code Accelerator (SAS Data Loader for Hadoop), SAS Scoring Accelerator for Hadoop (both relying on SAS Embedded Process) will bring new capabilities not only to process and clean the data but also to run analytics directly inside Hadoop.
The paper is not an exhaustive guide on all possible optimizations, but rather a collection of tricks and best practices reminders coming from the field experience. It will hopefully allow quick gains for consultant on the ground when performance issues arise in a SAS with/on/in Hadoop environment. The paper is available on the SAS support web site : here