SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

SAS with Hadoop: Performance considerations and monitoring strategies

Reply
SAS Employee
Posts: 2

SAS with Hadoop: Performance considerations and monitoring strategies

Hi there!

If you are working with Hadoop data (Hive, HDFS, SerDe files) in your SAS project and want to improve your performances and benefit as much as possible from the Hadoop cluster capabilities, then this new paper available on support.sas.com could be helpful for you.

It is not an exhaustive guide on all possible optimizations, but rather a collection of tricks and best practice reminders coming from the field experience. It will help the consultant on the ground when performance issues arise in a SAS with Hadoop environment.

The paper provides best practices, performance tricks, and guidance around monitoring for your SAS jobs in the cluster: https://support.sas.com/resources/thirdpartysupport/v94/hadoop/sas-hadoop-performance-strategies.pdf...

Valued Guide
Posts: 505

Re: SAS with Hadoop: Performance considerations and monitoring strategies

Very interesting 

 

I wander if you could provide a download of code and data (30 million rows)  used to produce the benchmarks on page 16. Also I rough Idea of hardware, software costs for rhe haddop site, number of simultaneous users and systemwide CPU utilization durring the benchmarks.

 

I don't have haddop but I do have an off lease dell T7400($600 circa 2008), dual XEONS, two RAID 0 SSD arrays and 64 gb of ram. I would like to set up SPDE and compare my timings with your benchmarks. 

 

Also it would be nice in the future if you could provide inexpensive power worksattion comparisons, when the data is less than 1TB.

 

My experience is that a old cheap power SAS workstation are up to an oder of magniture faster then  servers at 90 cpu utilzation. Servers are often tuned to run at 90% or more(average workday load), otherwise the company is wasting money.

Super User
Posts: 5,441

Re: SAS with Hadoop: Performance considerations and monitoring strategies

I think such a benchmark exercise is meaningless. As of today, SAS engines such as SPDE in a AMP architecture will outrun any Hadoop implementation unless the data size grows beyond 10s of TB's.
I would use this paper as guideline on what to do in such large environments, not as a blueprint for small scale Hadoop setups.
Data never sleeps
SAS Employee
Posts: 2

Re: SAS with Hadoop: Performance considerations and monitoring strategies

Thank you @rogerjdeangelis and @LinusH for your comments.

Unfortunately the sample table here is not publicly available at the moment.The SAS code used is very basic, for example, for the PROC HPSUMMARY it is something like :

/* run PROC HPSUMMARY ACCROSS THE NODES */
proc hpsummary data=hivelib.megacorp2;
performance nodes=all details;
   var expenses;
   output out=work.expenses_by_products;
class productbrand y;
run;

 

Those tests are more indicative and a way to show that the choice of the file format for the hadoop storage can be important depending on your use case. But they should not be considered as a reference (not like official benchamrk that are frequently published by our EEC service). Regarding SPDE, please not that I have used the SPDE format on HDFS (not the traditionnal SPDE format on local File System) for the comparison. 

 

I agree with LinusH final comments : Hadoop does not necessarily means better performance (especially if your SAS Server is well tuned, has good I/O and that the table is not that big). Hadoop has really been designed to provide scalability on huge amount of data that could not fit on a single machine or could not be processed in time or efficiently using SMP.

 

Thanks

Raphael

Ask a Question
Discussion stats
  • 3 replies
  • 389 views
  • 3 likes
  • 3 in conversation