Loading Apache Hive tables stored as Apache ORC™ data to the SAS® LASR™ Analytic Server

With the release of SAS® 9.4 M3, you can parallel load Hive tabular data as non-default file types to the SAS LASR Analytic Server. SAS Embedded Process for Hadoop now reads and processes Hive tables stored as ORC, Parquet, Sequence, Avro, and RC file types. In this post we discuss loading Hive tables to LASR stored as Optimized Row Columnar (ORC) files.

Hadoop Hive ORC file format

ORC is a file format designed for Hadoop workloads. It provides a highly efficient way to store Hive data in a Hadoop Distributed File System (HDFS). By using ORC, you can reduce the size of the original data by as much as 75% and improve Hive querying performance. For more detail about the ORC file format, refer to the following Apache sites: orc.apache.org and cwiki.apache.org

Data load to LASR

ORC Hive table data can be parallel loaded to LASR by using High-Performance Analytics (HPA) PROCs in conjunction with SAS® Embedded Process (in-database component) for Hadoop. SAS Embedded Process for Hadoop uses HCatalog to process complex non-delimited file formats. To access and process ORC Hive tables using SAS Embedded Process, you need a few additional Hadoop JAR files on the SAS client machine at the SAS_HADOOP_JAR_PATH location. The additional Hadoop JAR files are not included in a standard list of JAR files collected by SAS Embedded Process installation steps. You have to manually copy these JAR files from the Hadoop cluster to SAS Compute Server.

Requirements

Hive 11 or later must be installed on all nodes of Hadoop
SAS® 9.4 M3
Additional Hadoop JAR files on the SAS client machine

hive-hcatalog-core-*.jar
hive-webhcat-java-client-*.jar
jdo-api*.jar

Sample code to load LASR:

OPTION SET=GRIDHOST="sasserver01";
OPTION SET=GRIDINSTALLLOC="/opt/sas/TKGrid_3.1/CDH545_REP/TKGrid_REP";
OPTION SET=SAS_HADOOP_JAR_PATH="/opt/sas/thirdparty/Hadoop_Jars/CDH545";
OPTION SET=SAS_HADOOP_CONFIG_PATH="/opt/sas/thirdparty/Hadoop_Config/CDH545";

libname HDFS hadoop user=sasdemo server="sascdh01.race.sas.com" port=10000 SUBPROTOCOL=hive2;
libname lasr_ep sasiola port=54321 tag='HDFS';

proc lasr add port=54321
data=HDFS.mead_cust_order_orc;
run;
……………
……………………..

Suggestion for processing ORC-Hive table

When you store data as ORC-Hive tables, sometimes data files in HDFS become so small that execution of the data processing becomes slower in comparison to standard Hive tables. In fact, the ORC-Hive table processing becomes so slow that your Map Reduce process times out. The reason for jobs timing out could be because the YARN scheduler did not schedule an appropriate number of mapper processes to process the data using default parameters. For better performance using ORC-Hive tables, you must change some of the default parameters in the MapRead configuration. You might need to increase the Map Reduce job time-out parameters as well as a few other parameters in order to schedule a greater number of Mapper processes. The following parameters help when tuning the data processing of ORC-Hive tables:

mapreduce.task.timeout
mapred.max.split.size
mapreduce.input.fileinputformat.split.maxsize
mapreduce.input.fileinputformat.split.minsize

Before changing the above parameters, you must also consider the rest of the processes running in your Hadoop environment. A Hadoop administrator can assign appropriate values for these parameters.

To help illustrate this point, let’s consider an example. Suppose you have a customer information table with ~44 M rows and you have saved this table in Hive as both Text and ORC file formats. When you load to LASR from Text-Hive tables, the load script works without any issues. But when you load LASR from ORC-Hive tables, the load process stops responding and fails. When you look at the Hive metadata table size, you are surprised to see that the customer information table takes up ~8 GB as a Text-Hive table whereas the ORC-Hive table takes up ~100 MB in HDFS. You assume that ORC-Hive tables have smaller footprints in HDFS, so they should load faster and should not hang and fail.

On further investigation, you notice that Hadoop YARN schedules a single mapper process to process this ~100MB ORC-Hive file. In the default MapRead configuration, a single mapper process sounds okay, because the default block size in HDFS is 64 MB and a single mapper can process this ~100 MB file well within the allotted time and resources. However, this single mapper process has to read the ~100 MB file and regenerate ~8 GB of data while loading to LASR. This data re-creation takes time and resources, which makes the process run slower, and eventually it is unable to complete within the default job time-out limit.

To mitigate the issue of loading the ~100MB ORC-Hive table to LASR, you must modify the following parameters with these associated values in the mapred-site.xml file:

mapreduce.task.timeout - 1800000 (30 Min)
mapred.max.split.size - 5000000 (5MB)
mapreduce.input.fileinputformat.split.maxsize - 4000000 (4MB)
mapreduce.input.fileinputformat.split.minsize - 2000000 (2MB)

Below is metadata information from the ORC-Hive table.

….

…..

Location: hdfs://sascdh01.race.sas.com:8020/user/hive/warehouse/mead_cust_order_orc

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true

numFiles 1

numRows 43609840

rawDataSize 23636533280

totalSize 104872385

transient_lastDdlTime 1463500993

# Storage Information

SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde

InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

Compressed: No

…..

…………….

………

……………..

Location: hdfs://sascdh01.race.sas.com:8020/user/hive/warehouse/mead_cust_order

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true

SAS OS Name Linux

SAS Version 9.04.01M3P06242015

numFiles 1

numRows 43609840

rawDataSize 8363175360

totalSize 8406785200

transient_lastDdlTime 1463519886

# Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

………….

………………..

Before MapRead parameters were changed for the LASR load job running with a single mapper process:

The mapper process failed with a time-out after 600 seconds (10 minutes):

After MapRead parameters were changed for the LASR load job running with more than one mapper process:

The successful completion of the LASR load job running with more than one mapper process:

Below is the SAS log extract from the LASR load process:

………..

……..

HADOOP_6: Executed: on connection 2

USE `default`

NOTE: The LASR procedure is executing in the distributed computing environment with 3 worker nodes.

NOTE: Data loading into the LASR Analytic Server required 1129.172116 seconds.

NOTE: The data set HDFS.MEAD_CUST_ORDER_ORC.DATA has been added to the LASR Analytic Server. NOTE: The PROCEDURE LASR printed page 1.

NOTE: PROCEDURE LASR used (Total process time):

real time 19:16.66

cpu time 3.48 seconds

……

…………….

Below is the SAS lst file extract from the LASR load process:

The LASR Procedure Performance Information

Host Node sasserver01

Execution Mode Distributed

Number of Compute Nodes 3

Data Access Information

Data Engine Role Path

HDFS.MEAD_CUST_ORDER_ORC HADOOP Input Parallel, Asymmetric

Reference documents SAS(R) 9.4 In-Database Products Administrator’s Guide , Seventh Edition

Loading Apache Hive tables stored as Apache ORC™ data to the SAS® LASR™ Analytic Server

SAS Innovate 2025: Call for Content

Free course: Data Literacy Essentials

Get Started