Writing CAS and SAS Data to S3 (Parquet, Avro, …) files via AWS EMR

Amazon S3 is one of the most widely used platforms for storing big data. A secure, reliable, scalable, and affordable environment to store large data. Users can store various format of a data file on S3 location from different applications. Often SAS users are asking a question, whether SAS and Viya (CAS) applications can read and write Parquet, Avro, ORC, etc. data file stored on S3 locations. This post is about how to write CAS and SAS data to S3 with various data file format using AWS EMR.

Amazon offers a service-based Hadoop environment AWS EMR (Amazon Web Services Elastic Map Reduce), where user can read and write various data file format to S3 locations. With access to S3 bucket, a user can create an external hive Datawarehouse (schema/database) with data files located at S3. The data saved under an external database (schema) as hive table writes a file at S3 location. While saving the data user can specify the customized file format (Parquet, Avro, CSV, etc.) for an external hive table. The external database must be created with s3a:// protocol to access and process hive objects from a none-EMR process (read SAS EP).

With access to external EMR hive database, BASE SAS and CAS can write data as hive table (at S3) using SAS/ACCESS Interface, Data Connector, and Data Connect Accelerator. Means, user can write CAS table using serial or parallel method to S3 hive table.

The following diagram describes the data access from BASE SAS to S3 via AWS EMR.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The following diagram describes the data access from CAS to S3 via AWS EMR.

The following method describes a step by step process to write CAS and SAS data to S3 with the various file format.

Prerequisites

SAS/ACCESS Interface to Hadoop installed on Viya server.
Data Connector(Hadoop) installed on CAS servers for serial read/write.
Data Connect Accelerator (Hadoop) installed on CAS servers for parallel read/write.
SAS EP installed on EMR Hadoop cluster for parallel read/write.
Valid AWS ACCESS Key to access S3 bucket.
Read/write permission to S3 bucket.
The EMR Hadoop Jars and Configuration files are available on Viya and CAS servers.

The following screen-shot describes the example of an S3 bucket and folder where you want to write CAS and SAS table with the various file format.

Destination S3 Bucket and folder:

Steps

1. Create an external hive database with S3 location.

To write a CAS and SAS table data to S3 location user needs to create an external hive database with datafile at S3. The following SQL statement can be used to create an external hive database. Notice the location protocol ‘s3a://’ is used in the SQL statement. The SQL statement can be executed either using EMR Step or using Hive prompt at EMR Master node.

SQL:

$ hive
hive> 
create database geltest location 's3a://sas-viyadeploymentworkshop/gel/Data/db/';


hive> describe databases geltest

Log:

hive> create database geltest location 's3a://sas-viyadeploymentworkshop/gel/Data/db/';
OK
Time taken: 2.878 seconds

hive> describe database geltest ; 
OK
geltest         s3a://sas-viyadeploymentworkshop/gel/Data/db    hadoop  USER
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive>

2. Save SAS BASE table to EMR Hive table (S3).

With SAS/ACCESS installed and valid Hadoop Jars and Configuration files are placed on Viya compute servers, you can save a BASE SAS table to EMR hive. The following statement describes a SAS table data save to EMR hive with data files at S3. Notice the LIBNAME statement with schema= and DBCERATE_TABEL_OPTS=; which determines the S3 location and data file format.

Code:

options set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/EMRHadoopData/lib";
options set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/EMRHadoopData/conf";

libname hivelib1 hadoop user=hadoop server="emrhdp1" 
schema="geltest" DBCREATE_TABLE_OPTS="stored as parquet" ;

data hivelib1.cars_s3B;
  set sashelp.cars;
run;

Log:

76    options 1     OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
75    
76    options set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/EMRHadoopData/lib";
77    options set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/EMRHadoopData/conf";
78    
79    libname hivelib1 hadoop user=hadoop server="emrhdp1" schema="geltest" DBCREATE_TABLE_OPTS="stored as parquet" ;
NOTE: Libref HIVELIB1 was successfully assigned as follows: 
      Engine:        HADOOP 
      Physical Name: jdbc:hive2://emrhdp1:10000/geltest
80    
81    data hivelib1.cars_s3B;
82      set sashelp.cars;
83    run;
NOTE: SAS variable labels, formats, and lengths are not written to DBMS tables.
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set HIVELIB1.CARS_S3B has 428 observations and 15 variables.
NOTE: DATA statement used (Total process time):

As a result of saving new data table to the ‘geltest’ database, a new subfolder gets created with the hive table name under ‘db’ folder to store table data file.

Destination S3 Bucket and folder with new file:

3. Serial data save from CAS to EMR Hive table (S3).

With Data Connector installed and valid Hadoop Jars and Configuration are files placed on CAS servers, you can save CAS table to EMR hive table using the serial method. The following statement describes the serial data save from CAS to Hive with data file stored at S3. Notice the CASLIB statement with schema= and dbCreateTableOpts= options which determines the location for datafile and file format.

Code:

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib cashive_w datasource=(srctype="hadoop", server="emrhdp1",username="hadoop", 
schema="geltest", dbCreateTableOpts="stored as parquet", 
hadoopconfigdir="/opt/sas/viya/config/data/EMRHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/EMRHadoopData/lib");

/* load a samle table to CAS*/
proc casutil;
   load data=sashelp.cars casout="car_s3" outcaslib="cashive_w" replace ;
quit ;

/* Save a CAS table to EMR Hadoop   */ 
proc casutil  incaslib="cashive_w" outcaslib="cashive_w";
   save casdata="car_s3"  casout= "car_s3_new1" replace ;
quit;

proc casutil;
   list files incaslib="cashive_w";
   list tables incaslib="cashive_w"; 
quit ;

cas mysession terminate;

Log extract:

…………
……..
7    
88    /* Save a CAS table to EMR Hadoop   */
89    proc casutil  incaslib="cashive_w" outcaslib="cashive_w";
NOTE: The UUID '373e72c3-7ecf-8b48-ba5c-530ac008a6c4' is connected using session MYSESSION.
90       save casdata="car_s3"  casout= "car_s3_new1" replace ;
NOTE: Executing action 'table.save'.
NOTE: Performing serial SaveTable action using SAS Data Connector to Hadoop.
NOTE: Cloud Analytic Services saved the file car_s3_new1 in caslib CASHIVE_W.
NOTE: Action 'table.save' used (Total process time):
…..
……………..

As a result of saving new data table to the ‘geltest’ database, a new subfolder gets created with the hive table name under ‘db’ folder to store table data file.

Destination S3 Bucket and folder with new sub-folder:

Destination S3 Bucket and sub-folder with new file:

4. Parallel data save from CAS to EMR Hive table (S3).

The SAS EP process saves CAS table to EMR Hive table in parallel. The SAS EP is a none-EMR process and requires AWS_ACCESS_KEY and AWS_SECRET_KEY to access the S3 data files using s3a:// protocol. The SAS-EP also requires the hive and hive-hcatalog binary folder on each EMR node to process the S3 data files. The hive and hive-hcatalog binary folders are available on EMR master node only. To support the SAS EP process, you need to manually copy the /usr/lib/hive and /usr/lib/hive-hcatalog folder from the master node to rest of the EMR node.

Prerequisites

A valid AWS Access Key to access S3.
The /usr/lib/hive and /usr/lib/hive-hcatalog binary folder copied to all EMR nodes.
The core-site.xml file updated with AWS Access Keys at CAS nodes.
A hive database created with s3a:// protocol to read/write S3 data.

The AWS access key submitted to SAS EP via core-site.xml located at SAS client machine (CAS Servers). You need to add the following properties with valid values into the core-site.xml file on all CAS servers.

<property>
    <name>fs.s3a.fast.upload</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.s3a.buffer.dir</name>
    <value>/tmp/s3a</value>
  </property>
  <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
  </property>
  <property>
    <name>fs.s3a.endpoint</name>
    <value>s3.us-east-1.amazonaws.com</value>
  </property>
  <property>
    <name>fs.s3a.access.key</name>
    <value>"ASIA2X5ZP3TPNUR2"</value>
  </property>
  <property>
    <name>fs.s3a.secret.key</name>
    <value>"Jm4tqLQv4n1X64RxtIELjBdljAt+onC"</value>
  </property>
  <property>
    <name>fs.s3a.session.token</name>  <value>"FQoGZXIvYXdzEOD//////////wEaDAAMZonhOGERv+ /+YfBMw3N3GZ5VwjlFnR2XEfJGBX3ijQpprrBQ=="</value>
  </property>

With Data Connect Accelerator installed and valid Hadoop Jars and Configuration files are placed on CAS servers, you can save CAS data to S3 EMR Hive table in parallel using the following statement. Notice the Schema=, dbCreateTableOpts=, and dataTransferMode= which determines the file location, file format, and data transfer mode.

Code:

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib cashive_w datasource=(srctype="hadoop", server="emrhdp1",
username="hadoop",
schema="geltest", 
dbCreateTableOpts="stored as parquet", 
dataTransferMode="parallel",
hadoopconfigdir="/opt/sas/viya/config/data/EMRHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/EMRHadoopData/lib");

/* load a sample table to CAS*/
proc casutil;
   load data=sashelp.cars casout="car_s3" outcaslib="cashive_W" replace ;
quit ;

/* Save a CAS table to EMR Hadoop with data file at S3 location  */ 
proc casutil  incaslib="cashive_w" outcaslib="cashive_w";
   save casdata="car_s3"  casout= "car_s3_new2" replace ;
quit; 

proc casutil;
   list files incaslib="cashive_w";
   list tables incaslib="cashive_w"; 
quit ;

cas mysession terminate;

Log extract:

…..
…………..
90    
91    /* Save a CAS table to EMR Hadoop with data file at S3 location  */
92    proc casutil  incaslib="cashive_w" outcaslib="cashive_w";
NOTE: The UUID 'bdd90f8b-ea01-a74f-abe5-873990cda0ca' is connected using session MYSESSION.
93       save casdata="car_s3"  casout= "car_s3_new2" replace ;
NOTE: Executing action 'table.save'.
NOTE: Performing parallel SaveTable action using SAS Data Connect Accelerator for Hadoop.
NOTE: Cloud Analytic Services saved the file car_s3_new2 in caslib CASHIVE_W.
NOTE: Action 'table.save' used (Total process time):
….
………..

As a result of saving new data table to the ‘geltest’ database, a new subfolder gets created with the hive table name under ‘db’ folder to store table data file.

Destination S3 Bucket and folder with new sub-folder:

Destination S3 Bucket and sub-folder with new file:

5. Verify hive table for S3 location and file format.

By using EC2 key login to EMR master node as user 'hadoop,' execute the following SQL at hive prompt to verify the hive table with data file located at S3. Notice the data file location and SerDe library attached to this table.

SQL:

$ hive
hive> use geltest; 
hive> show tables; 
hive> desc fromatted car_s3_new2

Log:

hive> use geltest;
hive> show tables;
OK
car_s3_new1
car_s3_new2
cars_s3b
Time taken: 0.159 seconds, Fetched: 3 row(s)
hive> desc formatted car_s3_new2 ;
OK
# col_name              data_type               comment

make                    char(13)
model                   char(40)
type                    char(6)
origin                  char(6)
drivetrain              char(5)
msrp                    double
invoice                 double
enginesize              double
cylinders               double
horsepower              double
mpg_city                double
mpg_highway             double
weight                  double
wheelbase               double
length                  double

# Detailed Table Information
Database:               geltest
Owner:                  hadoop
CreateTime:             Mon Sep 09 18:41:54 UTC 2019
LastAccessTime:         UNKNOWN
Retention:              0
Location:               s3a://sas-viyadeploymentworkshop/gel/Data/db/car_s3_new2
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                1
        numRows                 428
        rawDataSize             6420
        totalSize               26634
        transient_lastDdlTime   1568054571

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.107 seconds, Fetched: 44 row(s)
hive>

Summary

To write CAS and SAS data to S3 location with various file format supported by Hadoop, the user must create an external hive database with S3 location. While saving SAS and CAS data table to S3 hive table user can specify the file format (Parquet, ORC, etc…) in LIBNAME and CASLIB statement.

Many thanks to David Ghazaleh for sharing information and knowledge on this topic.