Has anyone done much benchmarking of SPDE on HDFS vs Hive Tables.
I've done some preliminary investigate and I'm finding that SPDE is approx 2-3x slower than using Hive tables. Initial tests include query data and writing data back to the SAS workspace server, writing data to HDFS, and joining different tables (of different sizes) between Hadoop and SAS.
For Hive I'm using the ORCFile format with Cost Based Optimisation turned on and the execution engine is TEZ, so performance is good.
I'm also guessing that the SPDE engine for HDFS will be using MapReduce rather than Tez? But I'm unsure how to confirm this when running a query via SAS.
However, should querying performance for data via SPDE on HDFS be significantly slower than Hive?
We're using SAS 9.4 M2 so we don't have Parallel write capabilities from M3, so I'd expect that might slow things down a little when writing to HDFS, but I was hoping that SPDE on HDFS would be a little more.. speedy!?
Are there any easy performance improvments for SPDE other than the likes of: parallelread=yes parallelwrite=yes accelwhere=yes ?? Has anyone experimented IOBLOCKSIZE on HDFS?
On the plus side, data compatibility between SAS data in either (SPD Server and BASE engine) is better on SPDE for HDFS than Hive. I just wonder if that's the trade off. Slower performance on SPDE but better compatibility?
Has anyone else experienced something similar?
Cheers,
David
... View more