SAS works with Hadoop in several ways, through specific products and processes. And the intersections of SAS and Hadoop are growing each day. So, my SAS colleagues and I will post a series of articles on the Data Management Community devoted to various areas of SAS and Hadoop integration. This initial article outlines techniques used to persist native SAS data sets (sas7bdat) to Hadoop (Hive) and store them in a variety of formats.
Storing SAS data sets as text files
Writing native SAS data sets to Hive and storing them in a user-defined text format can be accomplished by using the COLUMN_DELIMITER= option. This option is valid in DATA and PROC steps when accessing Hadoop using the SAS/ACCESS to Hadoop engine. The following examples illustrate how to write a SAS data set to Hive as a pipe delimited text file using DATA Step and PROC SQL. The ‘hivelib’ libref references a location in Hive that has been established with your HADOOP libname statement. The ‘locallib’ libref references a location on your local SAS server.
data hivelib.cars (COLUMN_DELIMITER = '|');
create table hivelib.cars (COLUMN_DELIMITER = '|') as
select * from locallib.cars;
Viewing the data in Hive reveals the columns that are now separated by a Pipe.
Acura| 3.5 RL w/Navigation 4dr|Sedan|Asia|Front|46100|41100|3.5|6|225|18|24|3893|115|197|
Acura| NSX coupe 2dr manual S|Sports|Asia|Rear|89765|79978|3.200000000000000177636|6|290|17|24|3153|100|174|
Audi| A4 1.8T 4dr|Sedan|Europe|Front|25940|23508|1.800000000000000044409|4|170|22|31|3252|104|179|
Audi| A41.8T convertible 2dr|Sedan|Europe|Front|35940|32506|1.800000000000000044409|4|170|23|30|3638|105|180|
The delimiter must be a single character or a three digit ASCII value between 001 and 127. If the COLUMN_DELIMITER= option is not specified, the default delimiter used by SAS is Ctrl-A. When viewing the data in Hive, the separator will be shown as ^A.
Storing SAS data sets as other file types
Writing native SAS data sets to Hive and storing them as different file types can be accomplished by using the DBCREATE_TABLE_OPTS= option along with an appropriate DBMS-SQL-clause. This option is valid on the LIBNAME statement or in DATA and PROC steps when accessing Hadoop using the SAS/ACCESS to Hadoop engine. The following examples illustrate how to use this option.
Include the DBCREATE_TABLE_OPTS= option on the LIBNAME statement. Execute a DATA step or PROC SQL to write data to Hive and store it as a sequence file.
libname hivelib hadoop user=xxxxxx database=default
DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE';
create table hivelib.cars_seq as
select * from locallib.cars;
Alternatively, the option can be placed on the DATA step or the PROC SQL instead of the LIBNAME statement. For example:
data hivelib.cars (DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE');
create table hivelib.cars (DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE') as select * from locallib.cars;
Either of these methods can be used to write SAS data sets to Hive and store them as one of the following file types: TEXTFILE (default Ctrl-A file type), SEQUENCEFILE, ORC, and RCFILE.
If you need to write a SAS data set to Hive and store it as a PARQUET file type, use the methods described above and note that you must be running Hive version 0.14. If you are running an older version of Hive, you must first issue an explicit PROC SQL with a CREATE TABLE statement to create the table structure in Hive. Set the data types for all character variables to STRING. Secondly, in the same PROC SQL after the table has been created, you will need to load the table with the data from the native SAS data set using an EXECUTE (INSERT OVERWRITE TABLE) statement. A second explicit PROC SQL with an ALTER TABLE statement must then be issued to change the formats on the character variables from STRING to CHAR(n). This will ensure the appropriate field length is assigned to each character variable when reading this table from Hive. Otherwise SAS will apply a default length of 32k to each character variable when reading the table from Hive, resulting in an unnecessarily large output table.
When writing a SAS data set to Hive and storing it as an AVRO file type, use the methods described above and again, note that you must be running Hive version 0.14. Older versions of Hive do not support the 'STORED AS AVRO' statement.
For further information regarding the COLUMN_DELIMITER= option and the DBCREATE_TABLE_OPTS= LIBNAME or DATA Set option, refer to the SAS/ACCESS 9.4 for Relational Databases documentation.
And don't forget to follow the Data Management section of the SAS Communities Library for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference: