03-16-2016 12:08 PM
Has anyone done much benchmarking of SPDE on HDFS vs Hive Tables.
I've done some preliminary investigate and I'm finding that SPDE is approx 2-3x slower than using Hive tables. Initial tests include query data and writing data back to the SAS workspace server, writing data to HDFS, and joining different tables (of different sizes) between Hadoop and SAS.
For Hive I'm using the ORCFile format with Cost Based Optimisation turned on and the execution engine is TEZ, so performance is good.
I'm also guessing that the SPDE engine for HDFS will be using MapReduce rather than Tez? But I'm unsure how to confirm this when running a query via SAS.
However, should querying performance for data via SPDE on HDFS be significantly slower than Hive?
We're using SAS 9.4 M2 so we don't have Parallel write capabilities from M3, so I'd expect that might slow things down a little when writing to HDFS, but I was hoping that SPDE on HDFS would be a little more.. speedy!?
Are there any easy performance improvments for SPDE other than the likes of: parallelread=yes parallelwrite=yes accelwhere=yes ?? Has anyone experimented IOBLOCKSIZE on HDFS?
On the plus side, data compatibility between SAS data in either (SPD Server and BASE engine) is better on SPDE for HDFS than Hive. I just wonder if that's the trade off. Slower performance on SPDE but better compatibility?
Has anyone else experienced something similar?
03-16-2016 01:55 PM
Thanks @moorsd for sharing your experience.
I'm pretty sure SPDE uses MapReduce, so that should explain the preformance.
Can you describe your data and the environment you've done the tests in?
Also, what do you refer to when you say "higher" compability (in-database)? My gut feeling that SAS/ACCES to Hive is pushing more logic to the cluster than the SAS counterparts, but I might have missed some recent enhancements.
03-16-2016 04:27 PM
Thanks @moorsd and @LinusH for your questions and comments. Unfortunately I do not have access to an environment to benchmark Hive verse SPD data formats on hdfs; the environments I have access to are simply too small. A few comments on SPD Engine leveraging the MapReduce framework:
The SAS In-Database Code Accelerator for Hadoop uses the SAS DS2 language to run DS2 code in a MapReduce framework. To accomplish this, we create a DS2 THREAD program that contains our source table(s) and the business rules we need to apply to that data. To execute that THREAD program in parallel on every data node in your Hadoop cluster, we declare it to the DS2 DATA program and then execute it using the SET statement of that DATA program. To validate the PROC DS2 code ran in a MapReduce framework we will set the SAS option MSGLEVEL to I:
103 +OPTIONS MSGLEVEL=I; 104 +PROC DS2; 105 +THREAD work.thread / OVERWRITE=YES; 106 +DCL DOUBLE count; 107 +METHOD RUN (); 108 + SET myspde.&source; 109 + BY joinvalue; 110 + IF FIRST.joinvalue THEN count = 0; 111 + count + 1; 112 + IF LAST.joinvalue; 113 +END; 114 +ENDTHREAD; 115 +DATA MYSPDE.&source.2 (OVERWRITE=YES); 116 +DECLARE THREAD work.thread thrd; 117 +METHOD RUN (); 118 + SET FROM thrd; 119 +END; 120 +ENDDATA; 121 +RUN; NOTE: Created thread thread in data set work.thread. NOTE: Running THREAD program in-database NOTE: Running DATA program in-database NOTE: Execution succeeded. No rows affected. 122 +QUIT; NOTE: PROCEDURE DS2 used (Total process time): real time 20:39.18 cpu time 2.59 seconds
Notice the two notes, in the above SAS log, NOTE: Running THREAD/DATA program in-database. These notes validate the DS2 code executed in the MapReduce framework via the SAS In-Database Code Accelerator for Hadoop.
A note on IOBLOCKSIZE. If you are doing full table scans than you want a larger value i.e. 64K IOBLOCKSIZE=65536. If you are subsetting your data with WHERE clauses then you will want a smaller value i..e. 8K IOBLOCKSIZE=8192.
03-18-2016 07:51 AM
The data we're using comprises of some narrow dimensional tables (with 18 vars), some transactional data (with 30 vars) and some snapshot tables (either 2500 or 5000 vars). All of these tables have approx 57m observations.
The better compatility for SPDE come in the form of Indexes, formats and variable lengths being copied across from SPDS on SAS to SPDE on HDFS by default.
With SPDE on HDFS you can also do BY-GROUP processing and the floating point digits are exactly replicated between SPDS and SPDE. Hive has a higher floating point representation for numeric data, as it supports different data types so this causes a difference.
Also when retrieveing data from SPDE you don't have to set length statements. Data is pushed and pulled across in the same format. With hive if you pull character data back variables in SAS can be set to 32767. And if you have a lot of them like we do for the snapshot tables that have 5000+ variables this suddenly fills your temporary disk space in SAS.
All of which would make transitioning from SPDS on Unix to Hadoop easier from a devloper and analysts point of view.
The trade off for us will be at what point do you accept slower performance via SPDE vs Developer/analyst re-development time.
Does this help?
03-18-2016 08:09 AM
Well, it's always nice to know what the scenario looks like.
First, snapshot transactional table with thousands of variables...should you really import them in that format to SAS/DW storage?
My immediate architectural feeling is to store this in flat file format, and then import the parts your need, and transform it to some usable data model.
If we're talking 57m records as a total, that doesn't sound like a Hadoop case to me.
Depends on how you calculate, but worst case given your information:
5K variables * 10 bytes/variable * 57M observations = approx 3TB raw data.
I have colleagues that swing around a 20TB DW on a single node.
03-18-2016 01:48 PM
Thanks for information. The 57m file is part of one monthly snapshot table. We may have over 1800+ tables (of differeing sizes) that we'll store in Hadoop. Currently the tables are stored in Dynamic Cluster tables in SPDS.
Therefore, we plan to offload say 10 years of historic data to Hadoop and keep the most recent 'hot' data in SPDS.
The data in Hadoop will get queried and merged with the data in SPDS data, so we need it to be as performant as possible. Hence my many questions re partioning, indexing & SPDE recently.
Once again for your quick responses and suggestions, they're much appreciated.
04-22-2016 03:52 PM
For the SAS 9.4 SPD Engine (maintenance 3), to optimize WHERE processing, you can request that data subsetting be performed in the Hadoop cluster, which might improve performance by taking advantage of the filtering and ordering capabilities of MapReduce. See this documentation:
Regarding I/O operation performance, consider setting a different SPD Engine I/O block size. See this brief topic with links to the IOBLOCKSIZE= options:
For the complete documentation for using the SPD Engine to store data in a Hadoop cluster through HDFS, see:
04-22-2016 04:58 PM
My apologies. You specifically said that you were running SAS 9.4 maintenance 2. WHERE processing optimization using MapReduce is available in maintenance 2. However, maintenance 3 expanded optimized WHERE processing to include more operators and compound expressions.