Cloudera Impala tables can also be used as one of the data sources to load data into Cloud Analytic Services (CAS) for analysis. This blog post discusses the steps required to configure and connect SAS Viya with Cloudera Impala.
To connect to Cloudera Impala from SAS Viya, SAS has another database engine called SAS Data Connector to Impala which is included in SAS/ACCESS® Interface to Impala (on SAS Viya). The Data Connector to Impala works using an ODBC driver as compared to the Data Connector to Hadoop, which uses Hadoop JAR files.
To connect SAS Viya with Cloudera Impala, the CAS server controller needs to configure an ODBC connection using Impala ODBC driver with UnixODBC Driver Manager. To install, configure, and validate Impala ODBC connection, follow these steps:
1. Download UnixODBC driver manager and the Impala ODBC driver.
You can refer to the Cloudera ODBC Driver Installation Guide for the compatible version of UnixODBC driver manager. UnixODBC is free software and can be downloaded from the download site. If your CAS server controller (UNIX/Linux) environment already has an ODBC driver manager, you can use the same one and append the rest of the configuration to existing config files. To configure the Impala ODBC connection, you also need the latest version of Impala ODBC driver, which can be downloaded from the Cloudera site.
2. Install UnixODBC driver manager.
On the CAS server controller, use ‘root’ user to extract and configure the UnixODBC software. You can untar in your desired location. In this example, the software is being extracted under /opt folder.
$ gunzip unixODBC-2.3.4.tar.gz
$ tar -xvf unixODBC-2.3.4.tar
Configure unixODBC using the following UNIX statements:
$ cd /opt/unixODBC-2.3.4
$ ./configure --prefix=/opt/unixODBC-2.3.4 --disable-gui --disable-drivers
$ make
3. Install the Impala ODBC driver.
Using a YUM statement, install the Impala ODBC driver at the CAS server controller. In the following example, the .rpm file is located under the /opt folder. By default, Impala ODBC is installed under /opt/cloudera/impalaodbc folder.
$ cd /opt
$ yum --nogpgcheck localinstall ClouderaImpalaODBC-2.5.35.1006-1.el7.x86
4. Configure the Impala driver with UnixODBC.
Once the UnixODBC driver and ImpalaODBC are installed in the environment, you are required to re-configure the UnixODBC driver to include Impala ODBC in the search path. The following ‘configure’ and ‘make’ statements facilitate this purpose.
$ cd /opt/unixODBC-2.3.4
$ export LD_LIBRARY_PATH=/opt/unixODBC-2.3.4/lib
$ ./configure --prefix=/opt/cloudera/impalaodbc --with-unixodbc=/opt/unixODBC-2.3.4
$ make
$ make install
5. Update ODBC.ini, ODBCINST.ini files.
When Cloudera ODBC software is installed, it provides sample setup .ini files under the cloudera/impalaodbc/Setup/ folder. You can use the same files and update, or you can copy these files to your desired location and update ODBC.ini with the server name, port, and database schema from the Hadoop Impala environment. In the following example, the config files are being updated in the same /opt/cloudera/impalaodbc/Setup/ directory location.
Most of the files remain unchanged except for the ODBC.ini file, where you need to specify the host name and port number of where the Impala Daemon is running. You can verify your Hadoop cluster for the Impala demon process; it might be running on nodes other than the name node. In the following example of a three-node Hadoop cluster (sascdh01(namenode), sascdh02, sascdh03), the Impala demon is running on all the servers.
Update the following section in the ODBC.ini file with the host name and port number:
# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
# They can also be specified on the connection string.
HOST=sascdh01.race.sas.com
PORT=21050
Database=default
If you have any additional ODBC database connections defined, you can add them in the same ODBC.ini file. For example, I have a Postgres database connection defined, and I appended the same information in the “/opt/cloudera/impalaodbc/Setup/odbc.ini” file.
sasclient_dvdrental=sasclient_dvdrental
[sasclient_dvdrental]
Driver=/opt/sas/viya/home/lib64/psqlodbcw.so
ServerName=sasclient.race.sas.com
username=postgres
password=#######
database=dvdrental
port=5432
6. Edit the vars.yml file.
While preparing for SAS® Viya™ installation, update the vars.yml file and include the following lines under the file’s CAS_SETTINGS section to configure the Impala ODBC shared library path. The statement in vars.yml should be as listed, including the spaces and numerical prefixes. Depending on how you configured your Impala ODBC driver, you might need to specify the ODBC.ini file, the ODBCINST.ini file, or both files. The following example includes both files:
1: ODBCINI=/opt/cloudera/impalaodbc/Setup/odbc.ini
2: ODBCINST=/opt/cloudera/impalaodbc/Setup/odbcinst.ini
3: CLOUDERAIMPALAODBC=/opt/cloudera/impalaodbc/Setup/odbc.ini
4: LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/unixODBC-2.3.4:/opt/cloudera/impalaodbc/lib/64
Based on the above information in vars.yml, when SAS Viya software is installed, the cas.settings file reflects these parameters.
7. Update the cas.setting file.
If you are configuring Cloudera Impala ODBC driver after the CAS (Viya) installation, you need to manually update the ~/sas/viya/home/SASFoundation/cas.settings file on the CAS server controller to include the following environment variables.
export ODBCINI=/opt/cloudera/impalaodbc/Setup/odbc.ini ;
export ODBCINST=/opt/cloudera/impalaodbc/Setup/odbcinst.ini;
export CLOUDERAIMPALAODBC=/opt/cloudera/impalaodbc/Setup/odbc.ini;
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/unixODBC-2.3.4:/opt/cloudera/impalaodbc/lib/64
export ODBCSYSINI=/opt/cloudera/impalaodbc/Setup/
Note: ODBCSYSINI is also included as it is required for the environment. ODBCSYSINI= points to a folder where ODBCINST.ini resides.
8. Validate the SAS Viya connection to Impala.
After you have set up the environment following these steps, you can execute a SAS program from SAS® Studio to create a CASLIB with source type “impala” and load a table to CAS from the same source. The following code is an example of the creation of a CASLIB with source type “impala” and data loaded from an Impala table to CAS. The SAS log shows that while loading a table from the above defined source, it was performing a serial load table action using SAS DATA Connector to Impala.
cas mySession
sessopts=(messagelevel=all) ;
caslib implib datasource=(srctype="impala", username="hadoop",
server="sascdh01.race.sas.com",
database="default");
proc casutil ;
load casdata="s_heart" incaslib="implib" outcaslib="implib" casout="s_heart" replace ;
list tables incaslib="implib";
quit ;
cas mySession terminate ;
SAS log extract:
57 proc casutil ;
NOTE: The UUID '55b2c571-4656-034c-8001-e8a7b6ba55bd' is connected using session MYSESSION.
58 load casdata="s_heart" incaslib="implib" outcaslib="implib" casout="s_heart" replace ;
NOTE: Performing serial LoadTable action using SAS Data Connector to Impala.
NOTE: Cloud Analytic Services made the external data from s_heart available as table S_HEART in caslib implib.
NOTE: The Cloud Analytic Services server processed the request in 0.974435 seconds.
59 list tables incaslib="implib";
NOTE: Cloud Analytic Services processed the combined requests in 0.016342 seconds.
60 quit ;
Result of code execution:
For more information about this topic:
SAS® Viya™ 3.2: Deployment Guide
... View more