With the release of SAS® 9.4 M3, you can parallel load Hive tabular data as non-default file types to the SAS LASR Analytic Server. SAS Embedded Process for Hadoop now reads and processes Hive tables stored as ORC, Parquet, Sequence, Avro, and RC file types. In this post we discuss loading Hive tables to LASR stored as Optimized Row Columnar (ORC) files.
Hadoop Hive ORC file format
ORC is a file format designed for Hadoop workloads. It provides a highly efficient way to store Hive data in a Hadoop Distributed File System (HDFS). By using ORC, you can reduce the size of the original data by as much as 75% and improve Hive querying performance. For more detail about the ORC file format, refer to the following Apache sites: orc.apache.org and cwiki.apache.org
Data load to LASR
ORC Hive table data can be parallel loaded to LASR by using High-Performance Analytics (HPA) PROCs in conjunction with SAS® Embedded Process (in-database component) for Hadoop. SAS Embedded Process for Hadoop uses HCatalog to process complex non-delimited file formats. To access and process ORC Hive tables using SAS Embedded Process, you need a few additional Hadoop JAR files on the SAS client machine at the SAS_HADOOP_JAR_PATH location. The additional Hadoop JAR files are not included in a standard list of JAR files collected by SAS Embedded Process installation steps. You have to manually copy these JAR files from the Hadoop cluster to SAS Compute Server.
Sample code to load LASR:
libname HDFS hadoop user=sasdemo server="sascdh01.race.sas.com" port=10000 SUBPROTOCOL=hive2;
libname lasr_ep sasiola port=54321 tag='HDFS';
proc lasr add port=54321
Suggestion for processing ORC-Hive table
When you store data as ORC-Hive tables, sometimes data files in HDFS become so small that execution of the data processing becomes slower in comparison to standard Hive tables. In fact, the ORC-Hive table processing becomes so slow that your Map Reduce process times out. The reason for jobs timing out could be because the YARN scheduler did not schedule an appropriate number of mapper processes to process the data using default parameters. For better performance using ORC-Hive tables, you must change some of the default parameters in the MapRead configuration. You might need to increase the Map Reduce job time-out parameters as well as a few other parameters in order to schedule a greater number of Mapper processes. The following parameters help when tuning the data processing of ORC-Hive tables:
Before changing the above parameters, you must also consider the rest of the processes running in your Hadoop environment. A Hadoop administrator can assign appropriate values for these parameters.
To help illustrate this point, let’s consider an example. Suppose you have a customer information table with ~44 M rows and you have saved this table in Hive as both Text and ORC file formats. When you load to LASR from Text-Hive tables, the load script works without any issues. But when you load LASR from ORC-Hive tables, the load process stops responding and fails. When you look at the Hive metadata table size, you are surprised to see that the customer information table takes up ~8 GB as a Text-Hive table whereas the ORC-Hive table takes up ~100 MB in HDFS. You assume that ORC-Hive tables have smaller footprints in HDFS, so they should load faster and should not hang and fail.
On further investigation, you notice that Hadoop YARN schedules a single mapper process to process this ~100MB ORC-Hive file. In the default MapRead configuration, a single mapper process sounds okay, because the default block size in HDFS is 64 MB and a single mapper can process this ~100 MB file well within the allotted time and resources. However, this single mapper process has to read the ~100 MB file and regenerate ~8 GB of data while loading to LASR. This data re-creation takes time and resources, which makes the process run slower, and eventually it is unable to complete within the default job time-out limit.
To mitigate the issue of loading the ~100MB ORC-Hive table to LASR, you must modify the following parameters with these associated values in the mapred-site.xml file:
Below is metadata information from the ORC-Hive table.
Table Type: MANAGED_TABLE
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
Table Type: MANAGED_TABLE
SAS OS Name Linux
SAS Version 9.04.01M3P06242015
# Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat
Before MapRead parameters were changed for the LASR load job running with a single mapper process:
The mapper process failed with a time-out after 600 seconds (10 minutes):
After MapRead parameters were changed for the LASR load job running with more than one mapper process:
The successful completion of the LASR load job running with more than one mapper process:
Below is the SAS log extract from the LASR load process:
HADOOP_6: Executed: on connection 2
NOTE: The LASR procedure is executing in the distributed computing environment with 3 worker nodes.
NOTE: Data loading into the LASR Analytic Server required 1129.172116 seconds.
NOTE: The data set HDFS.MEAD_CUST_ORDER_ORC.DATA has been added to the LASR Analytic Server. NOTE: The PROCEDURE LASR printed page 1.
NOTE: PROCEDURE LASR used (Total process time):
real time 19:16.66
cpu time 3.48 seconds
Below is the SAS lst file extract from the LASR load process:
The LASR Procedure Performance Information
Host Node sasserver01
Execution Mode Distributed
Number of Compute Nodes 3
Data Access Information
Data Engine Role Path
HDFS.MEAD_CUST_ORDER_ORC HADOOP Input Parallel, Asymmetric
Reference documents SAS(R) 9.4 In-Database Products Administrator’s Guide , Seventh Edition