Hadoop is the most commonly used data storage and processing environment in Big Data world. Apart from storing plain text data formats, users may choose to store data in serialized formats with complex data types. One of the most widely used data serialization formats is Avro, which enables users to store nested data structures with complex data types like STRUCT, ARRAY, Maps etc. An Avro data file with complex data types can be created by various applications like Hive, SPARK, Impala, Tez, Pig, etc.
Now, as a SAS user we ask the question, "Out of the box, can SAS directly read an Avro binary data file with complex data types to load into CAS?" Answer is no! SAS cannot directly read an Avro binay data file, even if you try using PROC HDMD to generate a schema for the file (without something like a custom reader).
SAS can, however, read an Avro file via Hive (when an Avro data file schema is registered in Hive-Catalog). For Avro files with complex data types, the user needs to parse the complex data into multiple separate columns at hive before reading into SAS. To do this, the user may create a Hive view with parsed columns on top of the Avro Hive table to read each data element as separate columns from nested complex data.
The following steps describe loading CAS with an Avro data file with complex data type (STRUCT) using a Hive view.
The following Unix statement extracts the schema from the sample.avro data file and places the schema into the sample.avsc file.
$ avro-tools getschema sample.avro > sample.avsc
If you don’t have ‘avro-tools’ in your hadoop environment, you can download the compatible ‘avro-tools-XXXX.jar’ file from apache site and extract schema as follows.
$ java -jar ~/avro-tools-X.X.X.jar getschema sample.avro > sample.avsc
$ hadoop fs -put sample.avsc /user/cloud-user/avro_schema/
$ hadoop fs -put sample.avro /user/cloud-user/avro_data/
$ hadoop fs -ls -R /user/cloud-user
drwxr-xr-x - cloud-user hive 0 2018-05-07 16:26 /user/cloud-user/avro_data
-rw-r--r-- 3 cloud-user hive 754261 2018-05-07 16:26 /user/cloud-user/avro_data/sample.avro
drwxr-xr-x - cloud-user hive 0 2018-05-07 16:25 /user/cloud-user/avro_schema
-rw-r--r-- 3 cloud-user hive 10933 2018-05-07 16:25 /user/cloud-user/avro_schema/sample.avsc
CREATE EXTERNAL TABLE avro_sample_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/cloud-user/avro_data'
TBLPROPERTIES('avro.schema.url'='hdfs://servername.xxx.xxx:8020/user/cloud-user/avro_schema/sample.avsc');
hive> desc avro_sample_table ;
OK
key string
value struct'<'siteean:string,siteplgkey:string,countrycode:string,
aclgeskey:string,sitemaincode:string,sitefinkey:string,sitedesc:string,address1:string....
hive> select * from avro_sample_table limit 1;
OK
3020180258104
{"siteean":"3020180258104","siteplgkey":"","countrycode":"xx"..........}
Time taken: 0.144 seconds, Fetched: 1 row
hive>
Arrays and other repeating structures (e.g. an entity with multiple phone numbers) will need to be parsed into a flat structure (e.g. Phone1, Phone2, Phone3, …).
create view avro_sample_view as
select
key,
value.siteEan,
value.sitePlgKey,
value.countryCode,
value.aclGesKey,
...
value.geoLocation.geoLocType,
value.geoLocation.geoX,
value.geoLocation.geoY,
value.createDate,
value.deleteDate,
……
….
value.siteIntDesc
from avro_sample_table;
Here we store our view code in a file, ‘hv_view.sql’, and invoke the hive application to create view at hive.
$ hive -f hv_view.sql
75
75 ! load casdata="avro_sample_view" casout="avro_sample_view"
76 outcaslib="HiveEP" incaslib="HiveEP" ;
NOTE: Executing action 'table.loadTable'.
NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
NOTE: Cloud Analytic Services made the external data from avro_sample_view available as table AVRO_SAMPLE_VIEW in caslib HiveEP.
NOTE: Action 'table.loadTable' used (Total process time):
Table structure of the CAS table loaded from the Hive view, Avro_view:
Note 1: Separate columns for each data element from the complex STRUCT data type.
On the Hadoop cluster, the Data load to CAS job executes in two steps. First, it stages the data in a temporary location, then SAS EP feeds the data to CAS.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.