Amazon S3 is one the widely used platform to store big data. It’s a secured, reliable, scalable and affordable environment to store huge data. User can store various format of a data file on S3 location from different applications. Often SAS users are asking a question, whether SAS and Viya (CAS) applications can read and write Parquet, Avro, ORC, etc. data file stored on S3 locations. This post is about how to read various data files stored on S3 location using AWS EMR to SAS and CAS.
Amazon offers a service-based Hadoop environment AWS EMR (Amazon Web Services Elastic Map Reduce), where user can read and write various data file format to S3 locations. User can create a hive table with data files located on S3 and process using the Hadoop infrastructure. A hive table created with s3a:// protocol to access S3 data files from a none-EMR process (read SAS EP).
When Hive tables are available on AWS EMR, BASE SAS can access using SAS/ACCESS Interface to Hadoop component. CAS can access these hive table by using Data Connector and Data Connect Accelerator. Means, user can use serial or parallel data load method to load a Hive table to CAS.
The following diagram describes the data access from BASE SAS to S3 via AWS EMR.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The following diagram describes the data access from CAS to S3 via AWS EMR.
The following method describes a step by step process to access S3 files from CAS and BASE SAS.
Prerequisites:
The following screen-shots describe an S3 bucket and folder with CSV files or Parquet files which need to be read into SAS and CAS using the subsequent steps.
S3 Bucket and folder with CSV file:
S3 Bucket and folder with Parquet file:
To read a data file stored on S3, the user must know the file structure to formulate a create table statement. The following SQL statement can be used to create hive tables for the above CSV and Parquet file. Notice the data location protocol ‘s3a://’ is used in the SQL statement. Save the script file at the local desktop (C: drive) to upload at S3 location.
CREATE EXTERNAL TABLE baseball_S3(
name varchar(18),
team varchar(14),
natbat double,
nhits double,
nhome double,
nruns double,
nrbi double,
nbb double,
yrmajor double,
cratbat double,
crhits double,
crhome double,
crruns double,
crrbi double,
crbb double,
league varchar(8),
division varchar(8),
position varchar(8),
nouts double,
nassts double,
nerror double,
salary double,
div_1 varchar(16),
logsalary double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
STORED AS parquet
LOCATION 's3a://sas-viyadeploymentworkshop/gel/Data/baseball_prqt/' ;
CREATE EXTERNAL TABLE car_S3(
make varchar(13),
model varchar(40),
type varchar(8),
origin varchar(6),
drivetrain varchar(5),
msrp double,
invoice double,
enginesize double,
cylinders double,
horsepower double,
mpg_city double,
mpg_highway double,
weight double,
wheelbase double,
length double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://sas-viyadeploymentworkshop/gel/Data/cars/';
You need to upload the hsql file at an S3 location with access from EMR cluster. The script file will be executed in an “EMR step” to create the hive tables. The following screen-shot describes the saved hsql file at S3 location.
To create the hive table from saved hsql file, you needs to add an EMR Step. Navigate to available EMR cluster and add a step with Step type=Hive Program. Provide the S3 location for hsql file along with input and output S3 location. The following screen-shot describes the creation of an EMR step using hsql file saved at S3.
The EMR step will execute in a few seconds and will change the status from pending to running and to complete. If there is any error, you can view the details by expanding the properties.
You can verify the hive table and its properties from hive prompt. login to EMR master node as user “hadoop” and verify the table properties at hive prompt. You can also use beeline to verify the hive tables. The following hive SQL and output describes the hive table with an external data file stored at S3 and file format. Notice the data file location and SerDe attached to the table.
[hadoop@ip-10-249-20-173 ~]$ hive
hive> show tables;
OK
baseball_s3
car_s3
Time taken: 0.689 seconds, Fetched: 2 row(s)
hive> desc formatted baseball_s3 ;
OK
# col_name data_type comment
name varchar(18)
team varchar(14)
natbat double
nhits double
nhome double
nruns double
nrbi double
nbb double
yrmajor double
cratbat double
crhits double
crhome double
crruns double
crrbi double
crbb double
league varchar(8)
division varchar(8)
position varchar(8)
nouts double
nassts double
nerror double
salary double
div_1 varchar(16)
logsalary double
# Detailed Table Information
Database: default
Owner: hadoop
CreateTime: Wed Sep 04 17:53:52 UTC 2019
LastAccessTime: UNKNOWN
Retention: 0
Location: s3a://sas-viyadeploymentworkshop/gel/Data/baseball_prqt
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
numFiles 2
totalSize 77820
transient_lastDdlTime 1567619632
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \u0001
line.delim \n
serialization.format \u0001
Time taken: 0.132 seconds, Fetched: 53 row(s)
hive>
hive> select * from baseball_s3 limit 5 ;
OK
Allanson, Andy Cleveland 293.0 66.0 1.0 30.0 29.0 14.0 1.0 293.0 66.0 1.0 30.0 29.0 14.0 American East C 446.0 33.0 20.0 NULL NULL NULL
Ashby, Alan Houston 315.0 81.0 7.0 24.0 38.0 39.0 14.0 3449.0 835.0 69.0 321.0 414.0 375.0 National West C 632.0 43.0 10.0 475.0 NULL 6.163314804034641
Davis, Alan Seattle 479.0 130.0 18.0 66.0 72.0 76.0 3.0 1624.0 457.0 63.0 224.0 266.0 263.0 American West 1B 880.0 82.0 14.0 480.0 NULL 6.173786103901937
Dawson, Andre Montreal 496.0 141.0 20.0 65.0 78.0 37.0 11.0 5628.0 1575.0 225.0 828.0 838.0 354.0 National East RF 200.0 11.0 3.0 500.0 NULL 6.214608098422191
Galarraga, Andres Montreal 321.0 87.0 10.0 39.0 42.0 30.0 2.0 396.0 101.0 12.0 48.0 46.0 33.0 National East 1B 805.0 40.0 4.0 91.5 NULL 4.516338972281476
Time taken: 0.182 seconds, Fetched: 5 row(s)
hive>
With SAS/ACCESS installed and valid Hadoop Jars and Configuration files are placed on Viya servers, you can access the hive tables using the following statement. Notice the Hadoop Jar Path and Configuration location. The server=“emrhdp1” used in LIBNAME statement is the alias to EMR cluster master node.
Code:
options set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/EMRHadoopData/lib";
options set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/EMRHadoopData/conf";
libname hivelib1 hadoop user=hadoop server="emrhdp1";
Log:
76 options set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/EMRHadoopData/lib";
77 options set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/EMRHadoopData/conf";
78
79 libname hivelib1 hadoop user=hadoop server="emrhdp1";
NOTE: Libref HIVELIB1 was successfully assigned as follows:
Engine: HADOOP
Physical Name: jdbc:hive2://emrhdp1:10000/default
80
Data can be viewed by opening the table on SAS studio editor.
With Data Connector installed and valid Hadoop Jars and Configuration files are placed on CAS servers, you can serial load EMR hive table to CAS using the following statement. Notice the baseball hive table underneath data has Parquet file at S3 bucket.
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib cashive datasource=(srctype="hadoop", server="emrhdp1",
username="hadoop",
hadoopconfigdir="/opt/sas/viya/config/data/EMRHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/EMRHadoopData/lib");
proc casutil;
list files incaslib="cashive";
quit ;
/* load a Hive table to CAS*/
proc casutil;
load casdata="BASEBALL_S3" casout="BASEBALL_S3" outcaslib="cashive" replace ;
quit ;
proc casutil;
list tables incaslib="cashive";
quit ;
cas mysession terminate;
Log extract:
…………
……..
87 /* load a Hive table to CAS*/
88 proc casutil;
NOTE: The UUID '85e0b553-4927-db48-bd2b-83dcc61d1e2b' is connected using session MYSESSION.
89 load casdata="BASEBALL_S3" casout="BASEBALL_S3" outcaslib="cashive" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing serial LoadTable action using SAS Data Connector to Hadoop.
NOTE: Cloud Analytic Services made the external data from BASEBALL_S3 available as table BASEBALL_S3 in caslib cashive.
NOTE: Action 'table.loadTable' used (Total process time):
…..
……………..
Result extract:
The SAS EP process loads EMR Hive table data to CAS in parallel. The SAS EP is a none-EMR process and requires AWS_ACCESS_KEY and AWS_SECRET_KEY to access the S3 data files using s3a:// protocol. The SAS-EP also requires the hive and hive-hcatalog binary folder on each EMR node to process the S3 data files. The hive and hive-hcatalog binary folders are available on EMR master node only. To support the SAS EP process, need to manually copy the /usr/lib/hive and /usr/lib/hive-hcatalog folder from master node to rest of the EMR node.
Prerequisites:
The AWS access key information submitted to SAS EP via core-site.xml located at SAS client machine (CAS Servers). You need to add the following properties with valid values into the core-site.xml file on all CAS servers.
<property> <name>fs.s3a.fast.upload</name> <value>false</value> </property> <property> <name>fs.s3a.buffer.dir</name> <value>/tmp/s3a</value> </property> <property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value> </property> <property> <name>fs.s3a.endpoint</name> <value>s3.us-east-1.amazonaws.com</value> </property> <property> <name>fs.s3a.access.key</name> <value>"ASIA2X5ZP3TPNUR2"</value> </property> <property> <name>fs.s3a.secret.key</name> <value>"Jm4tqLQv4n1X64RxtIELjBdljAt+onC"</value> </property> <property> <name>fs.s3a.session.token</name> <value>"FQoGZXIvYXdzEOD//////////wEaDAAMZonhOGERv+ /+YfBMw3N3GZ5VwjlFnR2XEfJGBX3ijQpprrBQ=="</value> </property>
With Data Connect Accelerator installed and valid Hadoop Jars and Configuration files are placed on CAS servers, you can parallel load S3 hive table to CAS using the following statement.
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib cashive datasource=(srctype="hadoop", server="emrhdp1",
username="hadoop", dataTransferMode="parallel",
hadoopconfigdir="/opt/sas/viya/config/data/EMRHadoopData/conf",
hadoopjarpath="/opt/sas/viya/config/data/EMRHadoopData/lib");
proc casutil;
list files incaslib="cashive";
quit ;
/* load a S3-Based Hive table to CAS*/
proc casutil;
load casdata="BASEBALL_S3" casout="BASEBALL_S3" incaslib="cashive" outcaslib="cashive" replace ;
quit ;
proc casutil;
list tables incaslib="cashive";
quit ;
cas mysession terminate;
Log extract:
…..
…………..
86
87 /* load a S3-Based Hive table to CAS*/
88 proc casutil;
NOTE: The UUID '52638f51-9578-0240-9149-ac7322de9e77' is connected using session MYSESSION.
89 load casdata="BASEBALL_S3" casout="BASEBALL_S3" incaslib="cashive" outcaslib="cashive" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
NOTE: Cloud Analytic Services made the external data from BASEBALL_S3 available as table BASEBALL_S3 in caslib cashive.
….
………..
Result extract:
To access S3 data files having various file format supported by Hadoop, you must know the structure of the file and SerDe to use to read the file. If you can create the hive table on top of the S3 data files and be able to read records using native hive engine, the same hive table can be read into BASE SAS and can be loaded to CAS using Serial and Parallel data load method.
Stay tuned for my next post on “writing SAS and CAS data to S3 (Parquet, Avro, ORC … ) using AWS EMR.”
Many thanks to David Ghazaleh for sharing information and knowledge on this topic.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.