Hi there!
If you are working with Hadoop data (Hive, HDFS, SerDe files) in your SAS project and want to improve your performances and benefit as much as possible from the Hadoop cluster capabilities, then this new paper available on support.sas.com could be helpful for you.
It is not an exhaustive guide on all possible optimizations, but rather a collection of tricks and best practice reminders coming from the field experience. It will help the consultant on the ground when performance issues arise in a SAS with Hadoop environment.
The paper provides best practices, performance tricks, and guidance around monitoring for your SAS jobs in the cluster: https://support.sas.com/resources/thirdpartysupport/v94/hadoop/sas-hadoop-performance-strategies.pdf...
Very interesting
I wander if you could provide a download of code and data (30 million rows) used to produce the benchmarks on page 16. Also I rough Idea of hardware, software costs for rhe haddop site, number of simultaneous users and systemwide CPU utilization durring the benchmarks.
I don't have haddop but I do have an off lease dell T7400($600 circa 2008), dual XEONS, two RAID 0 SSD arrays and 64 gb of ram. I would like to set up SPDE and compare my timings with your benchmarks.
Also it would be nice in the future if you could provide inexpensive power worksattion comparisons, when the data is less than 1TB.
My experience is that a old cheap power SAS workstation are up to an oder of magniture faster then servers at 90 cpu utilzation. Servers are often tuned to run at 90% or more(average workday load), otherwise the company is wasting money.
Thank you @rogerjdeangelis and @LinusH for your comments.
Unfortunately the sample table here is not publicly available at the moment.The SAS code used is very basic, for example, for the PROC HPSUMMARY it is something like :
/* run PROC HPSUMMARY ACCROSS THE NODES */
proc hpsummary data=hivelib.megacorp2;
performance nodes=all details;
var expenses;
output out=work.expenses_by_products;
class productbrand y;
run;
Those tests are more indicative and a way to show that the choice of the file format for the hadoop storage can be important depending on your use case. But they should not be considered as a reference (not like official benchamrk that are frequently published by our EEC service). Regarding SPDE, please not that I have used the SPDE format on HDFS (not the traditionnal SPDE format on local File System) for the comparison.
I agree with LinusH final comments : Hadoop does not necessarily means better performance (especially if your SAS Server is well tuned, has good I/O and that the table is not that big). Hadoop has really been designed to provide scalability on huge amount of data that could not fit on a single machine or could not be processed in time or efficiently using SMP.
Thanks
Raphael
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.