SAS works with Hadoop in several ways, through specific products and processes. And the intersections of SAS and Hadoop are growing each day. So, my SAS colleagues and I will post a series of articles on the Data Management Community devoted to various areas of SAS and Hadoop integration. This initial article outlines techniques used to persist native SAS data sets (sas7bdat) to Hadoop (Hive) and store them in a variety of formats.
Storing SAS data sets as text files
Writing native SAS data sets to Hive and storing them in a user-defined text format can be accomplished by using the COLUMN_DELIMITER= option. This option is valid in DATA and PROC steps when accessing Hadoop using the SAS/ACCESS to Hadoop engine. The following examples illustrate how to write a SAS data set to Hive as a pipe delimited text file using DATA Step and PROC SQL. The ‘hivelib’ libref references a location in Hive that has been established with your HADOOP libname statement. The ‘locallib’ libref references a location on your local SAS server.
data hivelib.cars (COLUMN_DELIMITER = '|');
set locallib.cars;
run;
OR
proc sql;
create table hivelib.cars (COLUMN_DELIMITER = '|') as
select * from locallib.cars;
quit;
Viewing the data in Hive reveals the columns that are now separated by a Pipe.
Acura| 3.5 RL w/Navigation 4dr|Sedan|Asia|Front|46100|41100|3.5|6|225|18|24|3893|115|197|
Acura| NSX coupe 2dr manual S|Sports|Asia|Rear|89765|79978|3.200000000000000177636|6|290|17|24|3153|100|174|
Audi| A4 1.8T 4dr|Sedan|Europe|Front|25940|23508|1.800000000000000044409|4|170|22|31|3252|104|179|
Audi| A41.8T convertible 2dr|Sedan|Europe|Front|35940|32506|1.800000000000000044409|4|170|23|30|3638|105|180|
The delimiter must be a single character or a three digit ASCII value between 001 and 127. If the COLUMN_DELIMITER= option is not specified, the default delimiter used by SAS is Ctrl-A. When viewing the data in Hive, the separator will be shown as ^A.
Storing SAS data sets as other file types
Writing native SAS data sets to Hive and storing them as different file types can be accomplished by using the DBCREATE_TABLE_OPTS= option along with an appropriate DBMS-SQL-clause. This option is valid on the LIBNAME statement or in DATA and PROC steps when accessing Hadoop using the SAS/ACCESS to Hadoop engine. The following examples illustrate how to use this option.
Include the DBCREATE_TABLE_OPTS= option on the LIBNAME statement. Execute a DATA step or PROC SQL to write data to Hive and store it as a sequence file.
libname hivelib hadoop user=xxxxxx database=default
server="server.yourcompany.com"
subprotocol=hive2
config="/merged_xml_file.xml"
DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE';
data hivelib.cars_seq;
set locallib.cars;
run;
OR
proc sql;
create table hivelib.cars_seq as
select * from locallib.cars;
quit;
Alternatively, the option can be placed on the DATA step or the PROC SQL instead of the LIBNAME statement. For example:
data hivelib.cars (DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE');
set locallib.cars;
run;
OR
proc sql;
create table hivelib.cars (DBCREATE_TABLE_OPTS='STORED AS SEQUENCEFILE') as select * from locallib.cars;
quit;
Either of these methods can be used to write SAS data sets to Hive and store them as one of the following file types: TEXTFILE (default Ctrl-A file type), SEQUENCEFILE, ORC, and RCFILE.
If you need to write a SAS data set to Hive and store it as a PARQUET file type, use the methods described above and note that you must be running Hive version 0.14. If you are running an older version of Hive, you must first issue an explicit PROC SQL with a CREATE TABLE statement to create the table structure in Hive. Set the data types for all character variables to STRING. Secondly, in the same PROC SQL after the table has been created, you will need to load the table with the data from the native SAS data set using an EXECUTE (INSERT OVERWRITE TABLE) statement. A second explicit PROC SQL with an ALTER TABLE statement must then be issued to change the formats on the character variables from STRING to CHAR(n). This will ensure the appropriate field length is assigned to each character variable when reading this table from Hive. Otherwise SAS will apply a default length of 32k to each character variable when reading the table from Hive, resulting in an unnecessarily large output table.
When writing a SAS data set to Hive and storing it as an AVRO file type, use the methods described above and again, note that you must be running Hive version 0.14. Older versions of Hive do not support the 'STORED AS AVRO' statement.
For further information regarding the COLUMN_DELIMITER= option and the DBCREATE_TABLE_OPTS= LIBNAME or DATA Set option, refer to the SAS/ACCESS 9.4 for Relational Databases documentation.
And don't forget to follow the Data Management section of the SAS Communities Library for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference:
Nice post, even if I miss some elaboration between the suggested file types. What are the pros/cons for using your own delimeter?
And what differs between Text File, Sequence File, ORC, RCFile, Parquet and AVRO?
What to chose when...
Linus,
Thank you for your comments and questions. This is the first in a series of articles around SAS and Hadoop to be published by me and my colleagues. Future articles will elaborate on your comments but I have provided some feedback below.
In addition to using Hadoop as a vehicle to perform data management and analytics with SAS, many of our customers are moving away from expensive SAN storage and utilizing Hadoop as a data store. Customers often prefer to preserve the source format when storing data in Hadoop. Being able to store SAS data sets in Hadoop as a certain file type or as a text file with a user defined delimiter allows customers to maintain consistency among file structures and to minimize the number of file types and structures in Hadoop.
Regarding your question about the differences between the various file types, please refer to the following link. This is an excellent article which provides valuable information on the similarities/differences.
http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore
Best Regards,
Brian
Hi Linus,
You may find the following link useful. It discusses the different file types available in Hadoop. It typically boils down to how you will process the data. For example, Hive, Impala, etc. Some of the formats implement columar stores, others row based. In genereral, text is the poorest performer. Especially for Hadoop and Impala.
"Hadoop File Formats: It's not just CSV Anymore," by Chris Deptula
For more information you can take a look at the sites for the specific open source projects. You will also want to take a look at the SAS/ACCESS documentation for a specific release to see which file formats are support by the engines.
Enjoyed reading the article above! because of this article I understand clearly about how to persist native SAS data sets to Hadoop. Thanks you for such information great information you shared. Can you please share the information regarding prerequisites for learning Hadoop? Thank you.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.