SAS Data Connector/SAS Data Connect Accelerator Logging

7 Likes

The SAS option I use the most when accessing a third-party database from SAS is the very intuitive SASTRACE option.

Who has not used:

   options sastrace=",,,d" sastraceloc=saslog nostsuffix ;

to monitor which SQL query SAS is sending to the database?

Yes, you are a “comma comma comma d” fanatic!

But now in SAS Viya, you all know that a database is accessed differently. Not only from SAS with the traditional SAS/ACCESS engine but also from SAS Cloud Analytic Services (CAS) with the Data Connectors or Data Connect Accelerators. Does this “sastrace” option work for CAS as well to identify what CAS is sending to the database? Nope. We need something else.

First thing to note is that there is no real documentation (that I’m aware of) about the SAS Data Connector/SAS Data Connect Accelerator logging options.

The main source of information is this wonderful paper from SAS Global Forum 2017: Introduction to SAS® Data Connectors and SAS® Data Connect Accelerators on SAS® Viya™.

I encourage you to read this document for more details on the Data Connectors and Data Connect Accelerators in general and their tracing options in particular.

In this article, I’ll review 2 examples of how to set the Data Connector/Data Connect Accelerator logging options.

SAS Data Connector Tracing

There are 3 options to trace a SAS Data Connector communication:

DRIVER_TRACE to set the tracing levels
DRIVER_TRACEFILE to specify the trace file
DRIVER_TRACEOPTIONS to control the properties of the trace file

You can set those options in the CASLIB definition or specifically during a load or a specific operation using the dataSourceOptions option.

Here is an example:

caslib PG datasource=(srctype="postgres",authenticationdomain="PGAuth",server="mydb.example.com",
   database="dvdrental",schema="public",numreadnodes=10,numwritenodes=10,
   DRIVER_TRACE="SQL",
   DRIVER_TRACEFILE="/tmp/sasdcpg.log",
   DRIVER_TRACEOPTIONS="TIMESTAMP|APPEND") libref=PG ;

In this example:

DRIVER_TRACE: All SQL statements that are sent by SAS to the database are traced (“SQL”). But, one can decide to trace driver-specific information (“DRIVER” value), API method calls (“API” value) or everything (“ALL” value)
DRIVER_TRACEFILE: The target file for tracing information in this case is /tmp/sasdcpg.log. Notice that due to the massively parallel nature of CAS, it potentially means that this file will be created (automatically when trace starts) on each CAS node, not only on the CAS controller but also on the CAS workers. This is particularly useful in multi-node loading scenarios where each CAS worker sends a different SQL query to the database. There is no way to pipe the tracing information directly into the SAS Log (like traditional “sastraceloc=saslog” option does). You have to switch to the OS UI in order to view the trace file.
DRIVER_TRACEOPTIONS: The tracing file is in update mode (“APPEND”) and each trace information is prefixed with a timestamp (“TIMESTAMP”). One can also add a thread identification information (“THREADSTAMP”).

The CASLIB assignment doesn’t trigger the creation of the trace file yet, since no operation on the database has been done.

Now, let’s run a multi-node data load statement from this CASLIB data source (this is a MPP CAS and we have set numreadnodes appropriately):

proc casutil incaslib="PG" outcaslib="PG" ;
   load casdata="film" casout="film" replace ;
quit ;

As a result, on a 5-nodes CAS environment (1 CAS controller + 4 CAS workers), we get a sasdcpg.log file created in /tmp on every CAS node. Here are the contents:

CAS Controller:

10.01.04.36:          DEFAULT: SELECT * FROM "public"."film" WHERE 0=1
10.01.04.36:          DRIVER SQL: "SELECT * FROM "public"."film" WHERE 0=1" on connection 0x000000009c0d1480

CAS Worker 1:

10.01.04.37:          DEFAULT: SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 0 OR "film_id" IS NULL
10.01.04.37:          DRIVER SQL: "SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 0 OR "film_id" IS NULL" on connection 0x0000000024051520

CAS Worker 2:

10.01.04.37:          DEFAULT: SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 1
10.01.04.37:          DRIVER SQL: "SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 1" on connection 0x000000008cff6520

CAS Worker 3:

10.01.04.37:          DEFAULT: SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 2
10.01.04.37:          DRIVER SQL: "SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 2" on connection 0x00000000900a4520

CAS Worker 4:

10.01.04.37:          DEFAULT: SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 3
10.01.04.37:          DRIVER SQL: "SELECT * FROM (SELECT * FROM "public"."film") sliceSQL WHERE ABS(MOD("film_id", 4)) = 3" on connection 0x00000000a4274520

You can now observe and understand how a multi-node load works:

The CAS controller sends a dummy query to get the table structure and identify the first appropriate numeric column
The CAS controller instructs the CAS workers to send a specific query involving the use of the MOD function: each individual query gets a different portion of the source table and sends it to CAS

NB: These 3 options do not work for SAS Data Connector for Hadoop (dataTransferMode="serial") for some reason. Instead, use the SAS Data Connect Accelerator options (TRACEFLAGS and TRACEFILE) for Hadoop. See below.

SAS Data Connect Accelerator Tracing

There are 3 options to trace a SAS Data Connect Accelerator communication:

TRACEFLAGS to control the properties of the trace file
TRACEFILE to specify the trace file
DFDEBUG to get additional information back from the EP in the SAS Log (not in the tracing file)

You can set those options in the CASLIB definition or specifically during a load or a specific operation using the dataSourceOptions option.

Here is an example:

caslib HIVE datasource=(srctype="hadoop",server="myhdp.example.com",authenticationdomain="HiveCon",
   dataTransferMode="serial",hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf",
   hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib",schema="cashive",numreadnodes=10,numwritenodes=10,
   TRACEFLAGS="TIMESTAMP|APPEND",
   TRACEFILE="/tmp/sasdcaHadoop.log",
   DFDEBUG="EPALL") libref=HIVE ;

In this example:

TRACEFLAGS: The tracing file is in update mode (“APPEND”) and each trace information is prefixed with a timestamp (“TIMESTAMP”).
TRACEFILE: The target file for tracing information in this case is /tmp/sasdcaHadoop.log.
DFDEBUG: All debugging information from the SAS Embedded Process (“EPALL”) will be sent back to the SAS Log. “EPALL” works only for Hadoop. The other possible value for this option is “SQLDETAILS” but it doesn’t add any information for Hadoop, so I guess it’s useful only with Teradata. In short, DFDEBUG=”EPALL” for Hadoop, DFDEBUG=”SQLDETAILS” for Teradata.

Here is a sample output of the DFDEBUG=”EPALL” option:

 73         proc casutil incaslib="HIVE" outcaslib="HIVE" ;
 NOTE: The UUID 'c4cdb27f-4e4b-4340-94d7-43e9dfd1628b' is connected using session MYSESSION.
 74            load casdata="test_table" casout="test_table" replace ;
 NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop.
 NOTE: SAS Embedded Process tracking URL: http://myhdp.example.com:8088/proxy/application_1534533938504_0016/
 NOTE: Job Status ......: SUCCEEDED
 NOTE: Job ID ..........: 1534533938504_16
 NOTE: Job Name ........: SASEP SuperReader /apps/hive/warehouse/cashive.db/test_table
 NOTE: File splits..... : 4
 NOTE: Input records ...: 16000
 NOTE: Input bytes .....: 590288
 NOTE: Output records ..: 0
 NOTE: Output bytes ....: 0
 NOTE: Transcode errors : 0
 NOTE: Truncations .....: 0
 NOTE: Map Progress ....: 100.00%
 NOTE: Cloud Analytic Services made the external data from test_table available as table TEST_TABLE in caslib HIVE.
 NOTE: The Cloud Analytic Services server processed the request in 23.648689 seconds.
 75         quit ;
 
 NOTE: PROCEDURE CASUTIL used (Total process time):
       real time           23.65 seconds
       cpu time            0.00 seconds

When used in a parallel context (dataTransferMode="parallel"), the trace file will be located on the CAS controller only.

When used in a serial context (dataTransferMode="serial"), the trace file can be located on the CAS controller only (no multi-node load) or on all CAS nodes (multi-node load). You can check the multi-node SQL queries by using this command:

$ more sasdcaHadoop.log | grep -i sql
    In:        HandleType=TKTS_HANDLE_DBC=2, Handle=0x7fa1524ddb60, RecNumber=1, Sqlstate=0x7fa159a77fe0, NativeErrorPtr=0x7fa159a77fd4, MessageText=0x7fa159a777d0, BufferLength=513, TextLengthPtr=0x7fa159a777d0
    Out:       *Sqlstate=`01S02`,*NativeErrorPtr=-2130708291,MessageText=`Current catalog set to HIVE`,TextLengthPtr=27
    In:        HandleType=TKTS_HANDLE_DBC=2, Handle=0x7fa1524ddb60, RecNumber=2, Sqlstate=0x7fa159a77fe0, NativeErrorPtr=0x7fa159a77fd4, MessageText=0x7fa159a777d0, BufferLength=513, TextLengthPtr=0x7fa159a777d0
    In:        StatementHandle=0x7fa1524d6fe0, Sql=`SELECT * FROM (SELECT * FROM `cashive`.`test_table`) sliceSQL WHERE ABS(PMOD(`index_pk`, 4)) = 0 OR `index_pk` IS NULL`, SqlL=118
$

Hope this helps. Thanks for reading.