SAS has a long history with accessing data, from mainframe data to RDBMS data to PC file data. You name it and SAS can pretty much access it, so of course we have an access engine for Hadoop. If you are familiar with SAS/Access engines, don’t let the name mislead you. SAS/Access to Hadoop is NOT your average SAS/Access engine, mostly because Hadoop is not your average data environment. To help you understand this a little better I’ve pointed out a few things to consider below to tie into the SAS Data Management for Hadoop article series.
Hadoop is OPEN SOURCE
At it’s origination, Hadoop was an open source project from Apache. The main idea behind the project was to enable high speed searching of a wide variety of files to support search engines like Google and Yahoo. Through inexpensive storage and distributed processing, high-performance search was enabled. Since the boom of Big Data, several companies have come to the forefront and established their own distributions of Hadoop including, now familiar names, like Cloudera, Hortonworks, IBM BigInsights, MapR, and Pivotal. Companies like Teradata and Oracle have partnered with these organizations to also offer a Hadoop solution option for their customers.
Hadoop is a distributed FILE SYSTEM
Traditional RDBMS and the like are focused on DATA, not files. SAS/Access technology has a long history of partnering with these technology vendors to understand data types, flavors of SQL, and database-specific utilities. In this way, SAS technology can take advantage of efficiencies like SQL pushdown, bulkloading and even pushing SAS procedures to the databases for processing – thus limiting the impact of data movement on performance.
Hadoop has JARs…lots and lots of JARs
JARs, get used to them, and they aren’t for canning fruits and veggies like your mom or grandmother did. Unlike an RDBMS access engine, where SAS needs the database client installed in order to communicate with the database, Hadoop has .jar files and .xml files. SAS requires a set of .jar files and configuration .xml files in order to communicate with Hadoop, enabling things like HIVEQL pushdown and base SAS procedure pushdown. These files can change or move with each release of a distribution. SAS/Access to Hadoop needs to stay in sync with these files.
Hadoop is YOUNG
If you look around you can easily find seasoned, experienced DBAs as well as mature, stable database systems. RDBMS have gone through their growing pains and have been supporting organizations for decades. Hadoop is YOUNG, making its debut in 2005. As such, things are changing fast.
One final note: Before you upgrade your Hadoop environments, be sure to double check and file location or content change with your distribution!
Follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference: