Reading S3 (Parquet, Avro, …) files to CAS and SAS via AWS EMR

Amazon S3 is one the widely used platform to store big data. It’s a secured, reliable, scalable and affordable environment to store huge data. User can store various format of a data file on S3 location from different applications. Often SAS users are asking a question, whether SAS and Viya (CAS) applications can read and write Parquet, Avro, ORC, etc. data file stored on S3 locations. This post is about how to read various data files stored on S3 location using AWS EMR to SAS and CAS.

Amazon offers a service-based Hadoop environment AWS EMR (Amazon Web Services Elastic Map Reduce), where user can read and write various data file format to S3 locations. User can create a hive table with data files located on S3 and process using the Hadoop infrastructure. A hive table created with s3a:// protocol to access S3 data files from a none-EMR process (read SAS EP).

When Hive tables are available on AWS EMR, BASE SAS can access using SAS/ACCESS Interface to Hadoop component. CAS can access these hive table by using Data Connector and Data Connect Accelerator. Means, user can use serial or parallel data load method to load a Hive table to CAS.

The following diagram describes the data access from BASE SAS to S3 via AWS EMR.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The following diagram describes the data access from CAS to S3 via AWS EMR.

The following method describes a step by step process to access S3 files from CAS and BASE SAS.

Prerequisites:

SAS/ACCESS Interface to Hadoop installed on Viya server.
Data Connector (Hadoop) installed on CAS servers for serial read/write.
Data Connect Accelerator (Hadoop)installed on CAS servers for parallel read/write.
SAS EP installed on EMR Hadoop cluster for parallel read/write.
Valid AWS ACCESS Key to access S3 bucket.
Read/write permission on S3 bucket.
The EMR Hadoop Jars and Configuration files are available on Viya and CAS servers.

The following screen-shots describe an S3 bucket and folder with CSV files or Parquet files which need to be read into SAS and CAS using the subsequent steps.

S3 Bucket and folder with CSV file:

S3 Bucket and folder with Parquet file:

Steps

1. Prepare a hsql script file with ‘create table’ statement.

To read a data file stored on S3, the user must know the file structure to formulate a create table statement. The following SQL statement can be used to create hive tables for the above CSV and Parquet file. Notice the data location protocol ‘s3a://’ is used in the SQL statement. Save the script file at the local desktop (C: drive) to upload at S3 location.

CREATE EXTERNAL TABLE baseball_S3(
  name varchar(18),
  team varchar(14),
  natbat double,
  nhits double,
  nhome double,
  nruns double,
  nrbi double,
  nbb double,
  yrmajor double,
  cratbat double,
  crhits double,
  crhome double,
  crruns double,
  crrbi double,
  crbb double,
  league varchar(8),
  division varchar(8),
  position varchar(8),
  nouts double,
  nassts double,
  nerror double,
  salary double,
  div_1 varchar(16),
  logsalary double
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\u0001'
  LINES TERMINATED BY '\n'
  STORED AS parquet
LOCATION 's3a://sas-viyadeploymentworkshop/gel/Data/baseball_prqt/' ;

CREATE EXTERNAL TABLE car_S3(
make                    varchar(13),
model                   varchar(40),
type                    varchar(8),
origin                  varchar(6),
drivetrain              varchar(5),
msrp                    double,
invoice                 double,
enginesize              double,
cylinders               double,
horsepower              double,
mpg_city                double,
mpg_highway             double,
weight                  double,
wheelbase               double,
length                  double
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  STORED AS TEXTFILE
LOCATION 's3a://sas-viyadeploymentworkshop/gel/Data/cars/';

2. Upload the hsql script file to S3 location.

You need to upload the hsql file at an S3 location with access from EMR cluster. The script file will be executed in an “EMR step” to create the hive tables. The following screen-shot describes the saved hsql file at S3 location.

3. Add an EMR step to execute Hive sql and create Hive table.

To create the hive table from saved hsql file, you needs to add an EMR Step. Navigate to available EMR cluster and add a step with Step type=Hive Program. Provide the S3 location for hsql file along with input and output S3 location. The following screen-shot describes the creation of an EMR step using hsql file saved at S3.

On EMR cluster detail page navigate to Steps tab and select Add Step.
On ‘Add Step’ pop-up window provide required information and select “Add”, which will execute the script from S3 location to create hive tables. For the given example, use the following information. Notice the Script S3 location, where you have saved the hsql file.

EMR Add Step detail
--------------------------
Step type : Hive program
Name : Hiveprogram1
Script S3 lcoation:
s3://sas-viyadeploymentworkshop/gel/EMR/script/Create_S3_Hive_Table.hsql
Input S3 Location: s3://sas-viyadeploymentworkshop/gel/EMR/input/
Output S3 Location: s3://sas-viyadeploymentworkshop/gel/EMR/output/
The EMR step will execute in a few seconds and will change the status from pending to running and to complete. If there is any error, you can view the details by expanding the properties.

4. Verify Hive tables at EMR cluster.

You can verify the hive table and its properties from hive prompt. login to EMR master node as user “hadoop” and verify the table properties at hive prompt. You can also use beeline to verify the hive tables. The following hive SQL and output describes the hive table with an external data file stored at S3 and file format. Notice the data file location and SerDe attached to the table.

[hadoop@ip-10-249-20-173 ~]$ hive
hive> show tables;
OK
baseball_s3
car_s3
Time taken: 0.689 seconds, Fetched: 2 row(s)
hive> desc formatted baseball_s3 ;
OK
# col_name              data_type               comment

name                    varchar(18)
team                    varchar(14)
natbat                  double
nhits                   double
nhome                   double
nruns                   double
nrbi                    double
nbb                     double
yrmajor                 double
cratbat                 double
crhits                  double
crhome                  double
crruns                  double
crrbi                   double
crbb                    double
league                  varchar(8)
division                varchar(8)
position                varchar(8)
nouts                   double
nassts                  double
nerror                  double
salary                  double
div_1                   varchar(16)
logsalary               double

# Detailed Table Information
Database:               default
Owner:                  hadoop
CreateTime:             Wed Sep 04 17:53:52 UTC 2019
LastAccessTime:         UNKNOWN
Retention:              0
Location:               s3a://sas-viyadeploymentworkshop/gel/Data/baseball_prqt
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        numFiles                2
        totalSize               77820
        transient_lastDdlTime   1567619632

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             \u0001
        line.delim              \n
        serialization.format    \u0001
Time taken: 0.132 seconds, Fetched: 53 row(s)
hive>
hive> select * from baseball_s3 limit 5 ;
OK
Allanson, Andy  Cleveland       293.0   66.0    1.0     30.0    29.0    14.0    1.0     293.0   66.0    1.0     30.0    29.0      14.0    American        East    C       446.0   33.0    20.0    NULL    NULL    NULL
Ashby, Alan     Houston 315.0   81.0    7.0     24.0    38.0    39.0    14.0    3449.0  835.0   69.0    321.0   414.0   375.0     National        West    C       632.0   43.0    10.0    475.0   NULL    6.163314804034641
Davis, Alan     Seattle 479.0   130.0   18.0    66.0    72.0    76.0    3.0     1624.0  457.0   63.0    224.0   266.0   263.0     American        West    1B      880.0   82.0    14.0    480.0   NULL    6.173786103901937
Dawson, Andre   Montreal        496.0   141.0   20.0    65.0    78.0    37.0    11.0    5628.0  1575.0  225.0   828.0   838.0     354.0   National        East    RF      200.0   11.0    3.0     500.0   NULL    6.214608098422191
Galarraga, Andres       Montreal        321.0   87.0    10.0    39.0    42.0    30.0    2.0     396.0   101.0   12.0    48.0      46.0    33.0    National        East    1B      805.0   40.0    4.0     91.5    NULL    4.516338972281476
Time taken: 0.182 seconds, Fetched: 5 row(s)
hive>

5. BASE SAS access to EMR Hive table.

With SAS/ACCESS installed and valid Hadoop Jars and Configuration files are placed on Viya servers, you can access the hive tables using the following statement. Notice the Hadoop Jar Path and Configuration location. The server=“emrhdp1” used in LIBNAME statement is the alias to EMR cluster master node.

Code:

options set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/EMRHadoopData/lib";
options set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/EMRHadoopData/conf";

libname hivelib1 hadoop user=hadoop server="emrhdp1";

Log:

76    options set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/EMRHadoopData/lib";
77    options set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/EMRHadoopData/conf";
78    
79    libname hivelib1 hadoop user=hadoop server="emrhdp1";
NOTE: Libref HIVELIB1 was successfully assigned as follows: 
      Engine:        HADOOP 
      Physical Name: jdbc:hive2://emrhdp1:10000/default
80

Data can be viewed by opening the table on SAS studio editor.

6. Serial data load to CAS from EMR Hive table.

With Data Connector installed and valid Hadoop Jars and Configuration files are placed on CAS servers, you can serial load EMR hive table to CAS using the following statement. Notice the baseball hive table underneath data has Parquet file at S3 bucket.

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib cashive datasource=(srctype="hadoop", server="emrhdp1",
username="hadoop",
hadoopconfigdir="/opt/sas/viya/config/data/EMRHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/EMRHadoopData/lib");

proc casutil;
   list files incaslib="cashive";
quit ;

/* load a Hive table to CAS*/
proc casutil;
   load casdata="BASEBALL_S3" casout="BASEBALL_S3" outcaslib="cashive" replace ;
quit ;

proc casutil;
      list tables incaslib="cashive"; 
quit ;

cas mysession terminate;

Log extract:

…………
……..
87    /* load a Hive table to CAS*/
88    proc casutil;
NOTE: The UUID '85e0b553-4927-db48-bd2b-83dcc61d1e2b' is connected using session MYSESSION.
89       load casdata="BASEBALL_S3" casout="BASEBALL_S3" outcaslib="cashive" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Hadoop.
NOTE: Cloud Analytic Services made the external data from BASEBALL_S3 available as table BASEBALL_S3 in caslib cashive.
NOTE: Action 'table.loadTable' used (Total process time):
…..
……………..

Result extract:

7. Parallel data load to CAS from EMR Hive table.

The SAS EP process loads EMR Hive table data to CAS in parallel. The SAS EP is a none-EMR process and requires AWS_ACCESS_KEY and AWS_SECRET_KEY to access the S3 data files using s3a:// protocol. The SAS-EP also requires the hive and hive-hcatalog binary folder on each EMR node to process the S3 data files. The hive and hive-hcatalog binary folders are available on EMR master node only. To support the SAS EP process, need to manually copy the /usr/lib/hive and /usr/lib/hive-hcatalog folder from master node to rest of the EMR node.

Prerequisites:

Valid AWS ACCESS_KEY to access S3.
/usr/lib/hive and /usr/lib/hive-hcatalog binary folder copied to all EMR nodes.
Updated core-site.xml with AWS Access Keys at CAS nodes.
Hive table created with s3a:// protocol to read S3 data.

The AWS access key information submitted to SAS EP via core-site.xml located at SAS client machine (CAS Servers). You need to add the following properties with valid values into the core-site.xml file on all CAS servers.

<property>
    <name>fs.s3a.fast.upload</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.s3a.buffer.dir</name>
    <value>/tmp/s3a</value>
  </property>
  <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
  </property>
  <property>
    <name>fs.s3a.endpoint</name>
    <value>s3.us-east-1.amazonaws.com</value>
  </property>
  <property>
    <name>fs.s3a.access.key</name>
    <value>"ASIA2X5ZP3TPNUR2"</value>
  </property>
  <property>
    <name>fs.s3a.secret.key</name>
    <value>"Jm4tqLQv4n1X64RxtIELjBdljAt+onC"</value>
  </property>
  <property>
    <name>fs.s3a.session.token</name>  <value>"FQoGZXIvYXdzEOD//////////wEaDAAMZonhOGERv+ /+YfBMw3N3GZ5VwjlFnR2XEfJGBX3ijQpprrBQ=="</value>
  </property>

With Data Connect Accelerator installed and valid Hadoop Jars and Configuration files are placed on CAS servers, you can parallel load S3 hive table to CAS using the following statement.

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib cashive datasource=(srctype="hadoop", server="emrhdp1",
username="hadoop",  dataTransferMode="parallel",
hadoopconfigdir="/opt/sas/viya/config/data/EMRHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/EMRHadoopData/lib");

proc casutil;
   list files incaslib="cashive";
quit ;

/* load a S3-Based Hive table to CAS*/
proc casutil;
   load casdata="BASEBALL_S3" casout="BASEBALL_S3" incaslib="cashive" outcaslib="cashive" replace ;
quit ;

proc casutil;
   list tables incaslib="cashive"; 
quit ;

cas mysession terminate;

Log extract:

…..
…………..
86    
87    /* load a S3-Based Hive table to CAS*/
88    proc casutil;
NOTE: The UUID '52638f51-9578-0240-9149-ac7322de9e77' is connected using session MYSESSION.
89       load casdata="BASEBALL_S3" casout="BASEBALL_S3" incaslib="cashive" outcaslib="cashive" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
NOTE: Cloud Analytic Services made the external data from BASEBALL_S3 available as table BASEBALL_S3 in caslib cashive.
….
………..

Result extract:

Summary

To access S3 data files having various file format supported by Hadoop, you must know the structure of the file and SerDe to use to read the file. If you can create the hive table on top of the S3 data files and be able to read records using native hive engine, the same hive table can be read into BASE SAS and can be loaded to CAS using Serial and Parallel data load method.

Stay tuned for my next post on “writing SAS and CAS data to S3 (Parquet, Avro, ORC … ) using AWS EMR.”

Many thanks to David Ghazaleh for sharing information and knowledge on this topic.