When we talk about Data Lake, Hadoop is synonymous with the medium of implementation. Data in Hadoop can be accessed from SAS using SAS/ACCESS to Hadoop and SAS/ACCESS to ODBC. SAS/ACCESS to Hadoop and SAS/ACCESS to ODBC, each have their own place in a data lake, but using the former has inherent advantages over the later, ODBC. With the introduction of SAS/ACCESS to Spark (comes with SAS/ACCESS to Hadoop), it is even more appealing to use SAS/ACCESS to Hadoop. Here are 10 benefits of using SAS/ACCESS to Hadoop vs SAS/ACCESS to ODBC.
- SAS/ACCESS Interface to Hadoop reads data directly from the Hadoop Distributed File System (HDFS) when possible to improve performance. This differs from the traditional SAS/ACCESS engine behavior (ODBC), which exclusively uses database SQL to read and write data.
- There are more CREATE TABLE options available for ACCESS to Hadoop. A paper on it:
http://support.sas.com/resources/papers/proceedings17/SAS0409-2017.pdf
ACCESS to ODBC only supports: DBCREATE_TABLE_OPTS=
ACCESS to Hadoop has many options that assist in creating tables, such as
DBIDIRECTEXEC
|
- · DB CRE A TE _TA B LE _OP TS =
- · POST_STMT_OPTS=
- · POST_TABLE_OPTS=
- · PRE_STMT_OPTS=
- · PRE_TABLE_OPTS=
|
|
The most important “secret” that you need to know about the SAS create table options is that they enable you to specify
text strings that are placed into CREATE TABLE and CREATE TABLE AS SELECT (CTAS) statements that SAS generates.
Each option is responsible for placing the text string in a different location in the SQL statement. POST_STMT_OPTS= is
interesting because its behavior changes base on whether DBIDIRECTEXEC has been enabled.
- This covers ACCESS to Vertica vs ACCESS to ODBC but there are some good points in there (ignore INSERT/UPDATE/DELETE):
https://communities.sas.com/t5/SAS-Data-Management/SAS-Access-to-Vertica/td-p/130915
- Connections are easier, because they are DSN-less (SERVER=, PORT=, etc.)
- Advantages of using SAS/ACCESS to Hadoop, with respect to data transfer
- Streaming read/write is supported. This means moving data is much faster than ODBC.
- More HiveQL functions are eligible for passing to Hive.
- Better internationalization support with ACCESS to Hadoop.
- In the SAS Viya space you get Multi-node support with the CAS Data Connector
- Better integration with Hadoop High Availability technology. For example, the libname to connect to a Hive HA enabled system can be as simple as “libname x hadoop”.
PROC SQOOP procedure enables users to access an Apache Sqoop utility from a SAS session to transfer data
between a database and HDFS. Using SAS PROC SQOOP lets you submit Sqoop commands from within your SAS
application to your Hadoop cluster.
PROC SQOOP is licensed with SAS/ACCESS® Interface for Hadoop, it’s not part of the Base SAS® license. PROC
SQOOP is supported in UNIX and Windows SAS.
- Bulk loading is better using SAS/ACCESS to Hadoop Vs ODBC, one of the main reason is because, when BULKLOAD=YES parameter is set, the engine uses HDFS to load data into Hive
- Advantages of SAS/ACCESS to Hadoop with respect to In-Database processing, SAS In-database Procedure support (SUMMARY/MEANS, REPORT, SORT, TABULATE, FREQ). Notice, RANK is not supported.
- Most importantly ACCESS to Hadoop supports In-Database Technologies for Hadoop, really vital to avoid data movement to do Data Management activities. SAS Embedded Process in Hadoop plays a key role here, which ACCESS to Hadoop makes use of. Some areas of importance here are, (***Following are not supported with SAS/ACCESS to ODBC***)
- If you are planning to run SAS EP based process at Hadoop Cluster, SAS/ACCESS interface to Hadoop is must. The SAS EP based process let you use the Hadoop cluster resources to run data calculation at source and improve the performance. It also let you run selective SAS PROCS at Hadoop cluster.
- If you are planning to run SAS Data cleansing and Standardization operation at Hadoop cluster using QKB, these are SAS EP based process and you have to have the SAS/ACCESS interface to Hadoop to execute it.
- If you are planning to use SAS Data-loader for Hadoop (DL4H) software against a Hadoop cluster, you have to have SAS/ACCESS interface for Hadoop and SAS EP to support the DLH operations.
- If you are planning to read parallel data from Hadoop cluster to SAS LASR/CAS, you have to have the SAS/ACCESS interface for Hadoop with SAS EP.
- Hadoop optimized SQL pushdown using PROC SQL
- Support for PROC HDMD to create the metadata file in Hive library, stored in HDFS. This is useful for faster direct data load from HDFS and running In-Database scoring.
- Code Accelerator support for Hadoop. Makes use of Hive and HDMD. Again, this is really important when data movement is of concern, SAS code (DS2) can be pushed to Hadoop for processing.
- Beginning in 9.4M6 and Viya 3.4, In-Database Technologies for Hadoop includes In-Database Technologies for Spark.
- SAS Data Connect Accelerator for Spark
- SAS Scoring Accelerator for Spark
- SAS Embedded Process for Spark
Requirements for SAS In-Database Technologies for Spark
SAS In-Database Technologies for Spark requires SAS In-Database Technologies for Hadoop, which is separately licensed.
The SAS Embedded Process for Spark is included with SAS In-Database Technologies for Spark. The SAS Embedded
Process must also be installed on the Spark cluster to enable SAS Data Connect Accelerator for Spark. SAS recommends
installing the latest version of the SAS Embedded Process.
For information about supported versions and requirements: Click here
- The following applies to In-Database Technologies for Hadoop:
- For Viya 3.4: The platform option in the Hadoop caslib can be used to control whether the EP job is run using MapReduce (default) or Spark. This applies to both the Data Connect Accelerator and Scoring Accelerator. If platform=sparkis used, Scala code is generated by the application which is then submitted to the cluster to run as a Spark job.
- For SAS 9.4M6: The global option, HADOOPPLATFORM, is used to control how EP jobs are run. Setting this option equal to Spark causes the application to generate Scala code that describes the task to be run and submits that to the cluster as a Spark job. This isn’t described in external documentation yet but will be once M6 is available to customers.
In both cases, YARN Resource Manager UI can be used to examine how the job was run within the Spark execution environment if detailed information is needed. The SAS/CAS logs will show information about the results of the job.
In-Database Technologies for Spark is available in Viya 3.4 as a limited availability option.
Essentially SAS/ACCESS to ODBC connection has generic data access support. There are no Hadoop specific capabilities added or optimized.