About UttamKumar

UttamKumar · ‎11-20-2020

Azure HDInsight is a cloud distribution of Hadoop components. A SAS user may wonder how to connect an application with an HDI cluster. The SAS/ACCESS Interface to Hadoop and SAS Data connector to Hadoop technology enables SAS users to connect and access data from the HDInsight cluster. This post is about the components required for SAS 9.4 (M6/7) and CAS to connect and access the Hive tables from the Azure HDInsight cluster. Azure HDInsight Azure HDInsight is a managed, open-source Hadoop cluster service in the cloud. Azure HDInsight enables a user to create a global and optimized cluster with open-source frameworks such as Apache Hadoop, Spark, Hive, Interactive Query, Kafka, etc. . It provides a scalable Hadoop cluster that enables users to scale up or down based on workload. HDInsight integrates seamlessly with the most popular big data solutions Data access from SAS 9.4 The SAS 9.4 (M6/7) can access hive tables from the HDInsight Hadoop cluster using SAS/ACCESS Interface to Hadoop technology. The following diagram describes the data path and components required to access the HDInsight Hadoop cluster. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Pre-requisites SAS/ACCESS Interface to Hadoop at SAS Compute server. HDInsight Hadoop cluster with Hive Catalog. HDI Hadoop Jars and Configuration files at SAS Compute server. JAVA_HOME path in LD_LIBRARY_PATH at SAS Compute server. The HDInsight Hadoop cluster allows ssh to login to the master node server. A valid user can log in to the HDI master node and execute the SAS-Hadoop-Tracer script to collect required Hadoop Jars and config files. The SAS Embedded Process ( SAS EP) for Hadoop is not supported at the HDInsight Hadoop cluster, hence the in-database processing components (Code Accelerator, Data Quality Accelerator, etc.) are not supported. Sample code to read/write data into HDInsight: option set = SAS_HADOOP_CONFIG_PATH = "/opt/sas/viya/config/data/HDIHadoopData/conf"; option set = SAS_HADOOP_JAR_PATH = "/opt/sas/viya/config/data/HDIHadoopData/lib"; libname hdilib HADOOP uri='jdbc:hive2://hdiutkuma5.azurehdinsight.net:443/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2' server='hdiutkuma5.azurehdinsight.net' user="admin" pw=“XXXXXXXXX" schema="default"; data hdilib.cars_test; set sashelp.cars; run ; Data access from CAS CAS supports the Serial mechanism to load and save data to Azure HDInsight Hive table using SAS Data Connector to Hadoop. The following diagram describes the data path and components requires to access the Azure HDInsight. Pre-requisites SAS Data Connector to Hadoop at CAS Controller. HDInsight Hadoop cluster with Hive Catalog. HDI Hadoop Jars and Configuration files at CAS Controller JAVA_HOME path in LD_LIBRARY_PATH at CAS configuration setting. The HDInsight Hadoop cluster allows ssh to log in to the master node server. A valid user can log in to the HDI master node and execute the SAS-Hadoop-Tracer script to collect required Hadoop Jars and config files. The SAS Embedded Process ( SAS EP) for Hadoop is not yet supported at the HDInsight Hadoop cluster. Hence, the parallel data load to CAS is not supported. There is a known bug with the Multi-Node load method, the SAS data connector for Hadoop does not work well with specific hive numeric data column to create the split query. Hence, the Multi-Node data load/save to CAS is not supported. Sample code to read/write data into HDInsight : CAS mySession SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); caslib cashive datasource=(srctype="hadoop", server="hdiutkuma8.azurehdinsight.net", username="admin", pwd=“XXXXXXXXXX", schema=default, uri='jdbc:hive2://hdiutkuma8.azurehdinsight.net:443/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=hive2', hadoopconfigdir="/opt/sas/viya/config/data/HDIHadoopData/conf", hadoopjarpath="/opt/sas/viya/config/data/HDIHadoopData/lib"); proc casutil incaslib="cashive" outcaslib="cashive" ; load data=sashelp.prdsal3 casout="prdsal3" replace; save casdata="prdsal3" casout="prdsal3" replace; quit; proc casutil incaslib="cashive" outcaslib="cashive"; load casdata="prdsal3" casout="prdsal3_new" replace; list tables; quit; CAS mySession TERMINATE; Stay tuned for the next article on "SAS and CAS reading ADL2 files (Parquet, Avro, Jason, XML, etc.) via the HDInsight cluster". Additional Resources Azure HDInsight Hadoop Data Connector

UttamKumar · ‎09-18-2020

In SAS® Viya™ the analytic engine is the Cloud Analytic Services (CAS) Server, which uses high performance, multi-threaded analytic code to rapidly process requests against data of any size. Before you can use CAS to work with a data sets, you must load the data sets into the CAS server. With SAS Viya in the 16w20 release, SAS Studio® is the integrated programming environment for CAS. By using SAS Studio, you can write SAS code to load data from the source environment to CAS. This post explores the parallel data load method from a Hadoop Hive table to the CAS server. Data can be loaded from the Hive table to CAS using one of two methods, the serial method or the parallel method. The Data Connector facilitates the serial method and Data Connector Accelerators facilitate the parallel method. To enable parallel data loading from the Hive table to the CAS server, you will also need SAS® Embedded Process installed on Hadoop cluster. The concept of a parallel data load to CAS is similar to a parallel data load to a SAS® LASR™ Analytic Server. The following diagram depicts the parallel data load from a Hive table to the CAS environment. The data flows directly from the Hadoop data nodes to the CAS worker nodes. Using SAS Viya with CAS, the caslib statement enables you to define a SAS library with datasource= option, to include data source connection information. The caslib connection to the source environment uses either the serial or parallel method. The following example shows a caslib statement with a parallel connection to the Hive environment, using parameter dataTransferMode=”parallel”. The parallel data transfer mode works provided that SAS Embedded Process is installed on the Hadoop cluster. /* Assign EP HIVE CASLIB */ caslib hiveEP datasource=(srctype="hive",server="gatekrbhdp01.gatehadoop.com", dataTransferMode="parallel", hadoopconfigdir="/opt/sas/hadoop/client_conf", hadoopjarpath="/opt/sas/hadoop/client_jar"); When PROC CASUTIL is executed against the above libref, along with the list table or files statement, the Result Tab displays detail information about how the data source library is connected to the Hadoop Hive environment. When the data load to CAS statement is executed, the data will flow from the Hadoop Hive environment to CAS using the parallel route. Data is transferred from the Hadoop data nodes to the CAS worker nodes. /* Load HIVE tables (In memory) */ proc casutil; load casdata="stocks" casout="stocks" outcaslib="hiveEP" incaslib="hiveEP"; quit; To verify that the data load from Hive to CAS is using the parallel route, you have to verify within the Hadoop cluster using the MapReduce job log from the data feeder process execution. You will note references to SAS Embedded Process and DS2 in the MapReduce log as shown below. Log Type: stdout Log Upload Time: Wed Jun 15 23:22:34 -0400 2016 Log Length: 195 20160615:23.22.25.21: 00000012:WARNING: [01S02]Current catalog set to SASEP (0x80fff8bd) 20160615:23.22.25.60: 00000018:NOTE: All Embedded Process DS2 execution instances completed with SUCCESS. Log Type: syslog Log Upload Time: Wed Jun 15 23:22:34 -0400 2016 Log Length: 9273 Showing 4096 bytes of 9273 total. Click here for the full log. ………. ………. Related reading If you would like more information, see the Documentation Page.

UttamKumar · ‎09-18-2020

In SAS® Viya™ a caslib is an in-memory space on the CAS server to hold tables, access controls, and data source information. A caslib provides access to data from the data source environment and access to in-memory tables that are loaded to CAS from data sources. Caslibs create an association with access controls that define which user groups and individual users are authorized to access the data tables listed within a caslib. Caslib type A caslib can be personal, pre-defined, or manually added to the CAS server. The user authorization determines the user interaction with each type of caslib. Personal caslib A personal caslib can be configured while installing and configuring the CAS server. When a CAS session is initiated, the personal caslib is always available with global scope. This enables users to access CAS tables from any session with the same user ID. For example, in your test SAS Viya environment, you can have ‘casuser’ as the personal caslib. Pre-defined caslib Pre-defined caslibs are defined and managed by CAS administrators and have global scope. The data access controls are managed by the CAS server administrator to grant the permission to multiple users for caslib access. Pre-defined caslibs are used for popular data sources utilized by a range of CAS users. For example, in your CAS environment, you might have pre-defined caslibs for Hadoop-Hive or Oracle data sources. Manually added caslib Authorized users can add caslibs to the CAS server by running a caslib statement from SAS® Studio. In general, the manual caslibs are added in a program where there is need for ad hoc data access and users do not want to share the same data with all other users on the CAS server. Caslib scope The caslib is an in-memory space in the CAS server to hold tables. When a data table is loaded to the CAS server under caslib, the caslib scope, in association with access controls, facilitates CAS table access for users. Depending on the caslib scope, the CAS table can be shared with other users. There are two scopes for caslibs: session and global. Session-Scope caslib When a caslib is defined without including the GLOBAL option, the caslib is a session-scoped caslib. When a table is loaded to the CAS server with session-scoped caslib, the table is available to that specific CAS user session only. The following SAS code illustrates the creation of a manual caslib (HiveCaslib) defined as a session scope and data loaded to the CAS server within the same caslib. Notice, the CASLIB statement has no GLOBAL option. The PROC CASUTIL statement to load data to the CAS server also does not have the PROMOTE option. The PROMOTE option is valid only when the caslib is defined with global scope. CAS mySession host="gatekrbhdp01.gatehadoop.com" SESSOPTS=( CASLIB=casuser TIMEOUT=999 LOCALE="en_US"); /* Assign Std HIVE CASLIB */ caslib hivelib desc="HIVE Caslib" datasource=(SRCTYPE="HIVE",SERVER="gatekrbhdp01.gatehadoop.com", HADOOPCONFIGDIR="/opt/sas/hadoop/client_conf/", HADOOPJARPATH="/opt/sas/hadoop/client_jar/", schema="default",dfDebug=sqlinfo); /* Load HIVE tables (In memory) */ proc casutil; load casdata="stocks" casout="stocks" outcaslib="hivelib" incaslib="hivelib"; quit; After loading data, when PROC CASUTIL is executed using the list table statement from the same user session, the output is displayed as a session-local caslib along with associated tables. proc casutil; list tables incaslib="hivelib"; run; When the same user initiates a new session and tries to access a caslib and table loaded from the previous session, the caslib and table are not available in the new session since it was session-scoped. The following SAS code illustrates the creation of a new session by the same user who is trying to access the caslib and associated table that were loaded in the previous session. Notice the “CAS mySession2…..” statement. CAS mySession2 host="gatekrbhdp01.gatehadoop.com" SESSOPTS=( CASLIB=casuser TIMEOUT=999 LOCALE="en_US"); proc casutil; list tables incaslib="hivelib"; run; SAS log with error message: 56 57 proc casutil; NOTE: The UUID 'fadb6249-2976-3146-b7ad-b79bf4809f4d' is connected using session MYSESSION2. 58 ! list tables incaslib="hivelib"; ERROR: The caslib 'hivelib' does not exist in this session. ERROR: The action stopped due to errors. 59 run; Global-Scope caslib When a caslib is defined using a CASLIB statement with the GLOBAL option, the caslib is defined as a global-scoped caslib. The global-scoped caslib and associated table could be made available to other users in the CAS server by updating the caslib’s access controls. A group of users or individual users can be included in caslib access controls. If a table is loaded to the CAS server with the PROMOTE option within a global-scoped caslib, the table is available to all users who have access permission to the caslib. If a table is loaded to the CAS server without the PROMOTE option, within a global-scoped caslib, the table is still personal to the owner and not yet shared with the rest of the users on the CAS server. The CAS server administrator or authorized user can create session-scoped and global-scoped caslibs, and manage the access controls on caslibs. A table can be promoted from a session caslib to a global caslib. The following SAS code illustrates the creation of a manual, global-scoped caslib (HiveCaslib) and data load to the CAS server in the same caslib. Notice the CASLIB statement has the GLOBAL option to make it global-scoped and the PROC CASUTIL statement has the PROMOTE option to make it a shared table. CAS mySession host="gatekrbhdp01.gatehadoop.com" SESSOPTS=( CASLIB=casuser TIMEOUT=999 LOCALE="en_US"); /* Assign Std HIVE CASLIB */ caslib hivelib desc="HIVE Caslib" datasource=(SRCTYPE="HIVE",SERVER="gatekrbhdp01.gatehadoop.com", HADOOPCONFIGDIR="/opt/sas/hadoop/client_conf/", HADOOPJARPATH="/opt/sas/hadoop/client_jar/", schema="default",dfDebug=sqlinfo) GLOBAL ; /* Load HIVE tables (In memory) */ proc casutil; load casdata="stocks" casout="stocks" outcaslib="hivelib" incaslib="hivelib" PROMOTE ; quit; After the data is loaded, when the PROC CASUTIL is executed with the list table statement from the same user session, the output displays the caslib as a global-scoped caslib along with associated promoted tables. proc casutil; list tables incaslib="hivelib"; run; Once a global caslib is created, it appears under the list of caslibs on the Configuration tab on the Access Controls window. The CAS administrator can edit and update the user privileges to share the caslibs and associated tables with the rest of the users on the CAS server. A group of users or individual users can be granted access to the global-scoped caslibs on the CAS server. Once global caslib access control privileges are updated for a user, a new CAS session with the new user (valid user) can access the global CAS table. The following code illustrates the creation of a new CAS session by user sasdemo02 and the access to the CAS table created by user sasdemo01 in the global caslib. Upon execution of the assign _all_ statement, the global caslib is available to user sasdemo02. CAS mySession2 host="gatekrbhdp01.gatehadoop.com" SESSOPTS=( CASLIB=casuser TIMEOUT=999 LOCALE="en_US"); /* Show CASLIB in SAS Studio */ C ASLIB _ALL_ ASSIGN; The listed CAS data table can be opened by user sasdemo02 from available global caslibs. User privileges for creating session-scoped and global-scoped caslibs are managed by the CAS administrator using the following screen. A group of users or individual users can be granted permission to create session-scoped and global-scoped caslibs on the CAS server. Related reading For more information and documentation about SAS Viya caslibs, see the SAS Cloud Analytics Services Fundamentals.

UttamKumar · ‎09-18-2020

In SAS® Viya™, SAS Data Connector enables you to access and load data to Cloud Analytic Services (CAS). The SAS Data Connector contains connection information and specifics to connect with such data sources as Hadoop, Oracle, or SAS data sets. While loading data to CAS, SAS Data Connector supports data filtration. You can load a subset of data to CAS from source data with execution of the data subset condition at the source environment. The source environment processes the data condition and sends only the subset of data to CAS over the network. For simplicity, I am referring only to SAS Data Connector; the SAS Data Connect Accelerator also supports the same data filtration feature. However, SAS Data Connector is meant for serial data load from source to CAS and SAS Data Connect Accelerator is meant for parallel data load from source to CAS. The parameter values that you supply for SAS Data Connector are specific to the data source. In general, you specify parameters for SAS Data Connector in CAS code within the datasource= parameter when adding a ‘caslib’ or in the options= / dataSourceOptions= parameter when loading a data table to CAS using PROC CASUTIL or PROC CAS table action. The options= / dataSourceOptions= parameter is supported in PROC CASSUTIL and PROC CAS. It enables you to supply a “where” clause and a varlist= parameter value (a list of column names from a table) in order to process data filtration at the source environment. For example, the following CAS code loads data from the “stocks” Hive table to CAS with three columns and with data satisfying “stock” ticker equal to ‘IBM’ and “close” value greater than 100. SAS Data Connector sends a query with a list of columns and a “where” clause to the Hive server to process, and returns only a subset of data to CAS. When varlist= the parameter value supplied, the SAS Data Connector SQL construct contains a list of column names instead of ‘select *’. Example code: proc casutil; load casdata="stocks" casout="stocks" outcaslib="hiveEP" incaslib="hiveEP" varlist={"stock", "open", "close"} options={where="stock='IBM' and close > 100" } ; quit; Log Extract from code execution: The hadoop application job list The Hadoop application job list shows that loading the “stock” table from Hive to CAS was a two-step process. In the first step, SAS Data Connector executed the query with a list of columns and a “where” clause to create a temporary Hive table with a subset of data. In the second step, SAS Data Connector accelerator reads out of temp the Hive table (subset of data) to load into CAS. To support data filtration at the source environment (Hadoop), you must have sufficient temp space (on HDFS) to process the query submitted by SAS Data Connector. The following Hadoop application job logs show creation of a temp table/file while working on the data load request. Notice the RECORD_IN: value and the record written value. When using varlist= and options= or dataSourceOptions= parameters in PROC CASUTIL and PROC CAS, there are some rules to follow for how to supply the column names and the “where” clause. Column names casing must match casing of DBMS. Column names need to be quoted if they contain special characters or do not match the default case for the DBMS. A SQL statement uses the \ (backslash) as an escape character for double quotes. If you are loading a subset of data from an external database table, such as PostgreSQL or Teradata, then use the dbmsWhere= data connector parameter. With this parameter, the “where” clause that you specify is passed directly to the external database for use while loading the data. proc casutil; load casdata="cars" incaslib="tdlib" casout="cars_CAS" vars=((name="make"), (name="model")) dataSourceOptions=(dbmswhere="cylinders=8") ; quit; Related reading For more information on Loading a subset of Data table see, Loading a subset of data table .

UttamKumar · ‎09-18-2020

The SAS DATA step can be used to manipulate and prepare data for analysis and predictive modeling. In SAS Viya, the DATA step runs in both a traditional SAS client session as well as in SAS Cloud Analytic Services (CAS). The single program and multiple data paradigm (SPDM) is used to run a DATA step in a CAS server session. When a DATA step runs in CAS, the same program executes in multiple threads. Each thread operates on only part of the data. The benefit of running a DATA step in CAS, compared to Base SAS®, is that multiple cores are used to process the DATA step and there is more input/output bandwidth available using multiple machines. The DATA step does not restrict the number of input or output tables. However, there may be practical limits imposed by memory and other operating system limitations. The environment variable option DSACCEL=ANY enables users to run a DATA step in CAS. By default, this option is turned on to execute the DATA step in CAS, provided the input and output table references in the DATA step are associated with a CAS server libref. When the input or output table is a non CAS table, the DATA step is executed on the client side (SAS workspace server). During client-side execution, the DATA step pulls the data from CAS into the client side, does the calculations, prepares a new table, and stores/sends it to CAS to store into a new CAS table, if the output is a CAS table. The following example demonstrates the execution of a DATA step in the CAS server session. Both the input and the output tables are in a CAS engine libref. 64 /* Convert MPG to KPL into new CAS table*/ 65 data casdata.cars_kpl ; 66 set casdata.cars; 67 KPL_City = 0.425 * MPG_City; 68 KPL_Highway = 0.425 * MPG_Highway; 69 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: There were 428 observations read from the table CARS in caslib CASUSER(sasdemo). NOTE: The table cars_kpl in caslib CASUSER(sasdemo) has 428 observations and 17 variables. NOTE: DATA statement used (Total process time): real time 0.16 seconds cpu time 0.00 seconds The following example demonstrates the execution of a DATA step in the SAS client (SAS workspace server). Both the input data set and the output tables are not in a CAS engine libref. 58 /* Create a new CAS table by concatenting two table, a table from CAS and a table (non CAS table) from Client library */ 59 data casdata.cars_new; 60 set casdata.cars sashelp.cars ; 61 KPL_City = 0.425 * MPG_City; 62 KPL_Highway = 0.425 * MPG_Highway; 63 run; NOTE: To run DATA step in Cloud Analytic Services a CAS engine libref must be used with all data sets and all librefs in the program must refer to the same session. NOTE: Could not execute DATA step code in Cloud Analytic Services. Running DATA step in the SAS client. Impact of the MaxTableMem and the “copies” parameters The MaxTableMem property has no noticeable effect on the execution of a DATA step in CAS. Whether you keep the value for MaxTableMem as low as ~16 MB or as high as ~16 GB, this parameter does not impact the execution time of a DATA step in a CAS server session. However, the value assigned to the “copies” parameter for the CAS output table in a DATA step significantly impacts the execution time. In a multi-node CAS environment, if you keep the default value copies=1 for the CAS output table, the execution of a DATA step takes more time. In this case, most of the time is taken by the data movement between the nodes while creating the output table. If you are creating a DATA step program with many steps, it is recommended that you specify copies=0 in all the steps except the last one, assuming that you want to keep the final results as a CAS table. The following example demonstrates the impact of the value that you assign to the “copies” parameter while executing a DATA step in CAS. We have loaded a data table that contains 100 million rows to a multi-node CAS environment in order to execute a DATA step against the same table with copies=0 and copies=1 for the CAS output table. In the log, you can notice that when a DATA step is executed with copies=0, there is no data movement between the nodes and execution takes less time. Whereas, when a DATA step is executed on the same table with the default value copies=1, there is data movement between the nodes and, hence, it takes more time to execute. The following example demonstrates the DATA step processing in CAS with copies=0: 58 data casuser.junk2(copies=0 promote=yes ); 59 set casuser.CASJunkTable; 60 l = k*2; 61 m = l/j; 62 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads. NOTE: Executing action 'dataStep.runBinary'. NOTE: Executing action 'table.promote'. NOTE: Action 'table.promote' used (Total process time): NOTE: real time 0.080657 seconds NOTE: cpu time 0.188588 seconds (233.81%) NOTE: total nodes 4 (8 cores) NOTE: total memory 62.06G NOTE: memory 364.97K (0.00%) NOTE: There were 100000000 observations read from the table CASJUNKTABLE in caslib CASUSER(sasdemo). NOTE: The table junk2 in caslib CASUSER(sasdemo) has 100000000 observations and 5 variables. NOTE: Action 'datastep.runBinary' used (Total process time): NOTE: real time 7.845770 seconds NOTE: cpu time 40.669176 seconds (518.36%) NOTE: total nodes 4 (8 cores) NOTE: total memory 62.06G NOTE: memory 364.97K (0.00%) NOTE: DATA statement used (Total process time): real time 8.00 seconds cpu time 0.02 seconds The following example demonstrates the DATA step processing in CAS with copies=1: 59 data casuser.junk3(promote=yes); 60 set casuser.CASJunkTable; 61 l = k*2; 62 m = l/j; 63 run; NOTE: Running DATA step in Cloud Analytic Services. NOTE: The DATA step will run in multiple threads. NOTE: Executing action 'dataStep.runBinary'. NOTE: Executing action 'table.promote'. NOTE: Action 'table.promote' used (Total process time): NOTE: real time 0.106011 seconds NOTE: cpu time 0.111325 seconds (105.01%) NOTE: total nodes 4 (8 cores) NOTE: total memory 62.06G NOTE: memory 566.66K (0.00%) NOTE: There were 100000000 observations read from the table CASJUNKTABLE in caslib CASUSER(sasdemo). NOTE: The table junk3 in caslib CASUSER(sasdemo) has 100000000 observations and 5 variables. NOTE: Action 'datastep.runBinary' used (Total process time): NOTE: real time 38.682911 seconds NOTE: cpu time 57.775873 seconds (149.36%) NOTE: data movement time 31.019510 seconds NOTE: total nodes 4 (8 cores) NOTE: total memory 62.06G NOTE: memory 566.66K (0.00%) NOTE: bytes moved 3.73G NOTE: DATA statement used (Total process time): real time 38.85 seconds cpu time 0.01 seconds Other noticeable limitations While running a DATA step in a CAS server session, only the double, character, and VARCHAR data types are supported. Other CAS data types in input tables are converted to one of these three supported types. When running a DATA step in multiple threads, input table rows are divided among threads. Each thread of a DATA step sees only a part of the data, not the entire table. When dividing data among several threads of a DATA step, the results may not be the same as when only one thread is used. For example, when a RETAIN statement is used, a value is retained from one row to the next. Often, this approach is used to create a sum from all rows. Because each thread operates on only part of the data, each thread holds and stores a partial sum. The full functionality of a Base SAS DATA step is not yet supported in CAS. For more information about executing a DATA step in CAS, a list of limitations, and unexpected discoveries, see Run a DATA step in CAS

UttamKumar · ‎09-18-2020

Cloudera Impala tables can also be used as one of the data sources to load data into Cloud Analytic Services (CAS) for analysis. This blog post discusses the steps required to configure and connect SAS Viya with Cloudera Impala. To connect to Cloudera Impala from SAS Viya, SAS has another database engine called SAS Data Connector to Impala which is included in SAS/ACCESS® Interface to Impala (on SAS Viya). The Data Connector to Impala works using an ODBC driver as compared to the Data Connector to Hadoop, which uses Hadoop JAR files. To connect SAS Viya with Cloudera Impala, the CAS server controller needs to configure an ODBC connection using Impala ODBC driver with UnixODBC Driver Manager. To install, configure, and validate Impala ODBC connection, follow these steps: 1. Download UnixODBC driver manager and the Impala ODBC driver. You can refer to the Cloudera ODBC Driver Installation Guide for the compatible version of UnixODBC driver manager. UnixODBC is free software and can be downloaded from the download site. If your CAS server controller (UNIX/Linux) environment already has an ODBC driver manager, you can use the same one and append the rest of the configuration to existing config files. To configure the Impala ODBC connection, you also need the latest version of Impala ODBC driver, which can be downloaded from the Cloudera site. 2. Install UnixODBC driver manager. On the CAS server controller, use ‘root’ user to extract and configure the UnixODBC software. You can untar in your desired location. In this example, the software is being extracted under /opt folder. $ gunzip unixODBC-2.3.4.tar.gz $ tar -xvf unixODBC-2.3.4.tar Configure unixODBC using the following UNIX statements: $ cd /opt/unixODBC-2.3.4 $ ./configure --prefix=/opt/unixODBC-2.3.4 --disable-gui --disable-drivers $ make 3. Install the Impala ODBC driver. Using a YUM statement, install the Impala ODBC driver at the CAS server controller. In the following example, the .rpm file is located under the /opt folder. By default, Impala ODBC is installed under /opt/cloudera/impalaodbc folder. $ cd /opt $ yum --nogpgcheck localinstall ClouderaImpalaODBC-2.5.35.1006-1.el7.x86 4. Configure the Impala driver with UnixODBC. Once the UnixODBC driver and ImpalaODBC are installed in the environment, you are required to re-configure the UnixODBC driver to include Impala ODBC in the search path. The following ‘configure’ and ‘make’ statements facilitate this purpose. $ cd /opt/unixODBC-2.3.4 $ export LD_LIBRARY_PATH=/opt/unixODBC-2.3.4/lib $ ./configure --prefix=/opt/cloudera/impalaodbc --with-unixodbc=/opt/unixODBC-2.3.4 $ make $ make install 5. Update ODBC.ini, ODBCINST.ini files. When Cloudera ODBC software is installed, it provides sample setup .ini files under the cloudera/impalaodbc/Setup/ folder. You can use the same files and update, or you can copy these files to your desired location and update ODBC.ini with the server name, port, and database schema from the Hadoop Impala environment. In the following example, the config files are being updated in the same /opt/cloudera/impalaodbc/Setup/ directory location. Most of the files remain unchanged except for the ODBC.ini file, where you need to specify the host name and port number of where the Impala Daemon is running. You can verify your Hadoop cluster for the Impala demon process; it might be running on nodes other than the name node. In the following example of a three-node Hadoop cluster (sascdh01(namenode), sascdh02, sascdh03), the Impala demon is running on all the servers. Update the following section in the ODBC.ini file with the host name and port number: # Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here. # They can also be specified on the connection string. HOST=sascdh01.race.sas.com PORT=21050 Database=default If you have any additional ODBC database connections defined, you can add them in the same ODBC.ini file. For example, I have a Postgres database connection defined, and I appended the same information in the “/opt/cloudera/impalaodbc/Setup/odbc.ini” file. sasclient_dvdrental=sasclient_dvdrental [sasclient_dvdrental] Driver=/opt/sas/viya/home/lib64/psqlodbcw.so ServerName=sasclient.race.sas.com username=postgres password=####### database=dvdrental port=5432 6. Edit the vars.yml file. While preparing for SAS® Viya™ installation, update the vars.yml file and include the following lines under the file’s CAS_SETTINGS section to configure the Impala ODBC shared library path. The statement in vars.yml should be as listed, including the spaces and numerical prefixes. Depending on how you configured your Impala ODBC driver, you might need to specify the ODBC.ini file, the ODBCINST.ini file, or both files. The following example includes both files: 1: ODBCINI=/opt/cloudera/impalaodbc/Setup/odbc.ini 2: ODBCINST=/opt/cloudera/impalaodbc/Setup/odbcinst.ini 3: CLOUDERAIMPALAODBC=/opt/cloudera/impalaodbc/Setup/odbc.ini 4: LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/unixODBC-2.3.4:/opt/cloudera/impalaodbc/lib/64 Based on the above information in vars.yml, when SAS Viya software is installed, the cas.settings file reflects these parameters. 7. Update the cas.setting file. If you are configuring Cloudera Impala ODBC driver after the CAS (Viya) installation, you need to manually update the ~/sas/viya/home/SASFoundation/cas.settings file on the CAS server controller to include the following environment variables. export ODBCINI=/opt/cloudera/impalaodbc/Setup/odbc.ini ; export ODBCINST=/opt/cloudera/impalaodbc/Setup/odbcinst.ini; export CLOUDERAIMPALAODBC=/opt/cloudera/impalaodbc/Setup/odbc.ini; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/unixODBC-2.3.4:/opt/cloudera/impalaodbc/lib/64 export ODBCSYSINI=/opt/cloudera/impalaodbc/Setup/ Note: ODBCSYSINI is also included as it is required for the environment. ODBCSYSINI= points to a folder where ODBCINST.ini resides. 8. Validate the SAS Viya connection to Impala. After you have set up the environment following these steps, you can execute a SAS program from SAS® Studio to create a CASLIB with source type “impala” and load a table to CAS from the same source. The following code is an example of the creation of a CASLIB with source type “impala” and data loaded from an Impala table to CAS. The SAS log shows that while loading a table from the above defined source, it was performing a serial load table action using SAS DATA Connector to Impala. cas mySession sessopts=(messagelevel=all) ; caslib implib datasource=(srctype="impala", username="hadoop", server="sascdh01.race.sas.com", database="default"); proc casutil ; load casdata="s_heart" incaslib="implib" outcaslib="implib" casout="s_heart" replace ; list tables incaslib="implib"; quit ; cas mySession terminate ; SAS log extract: 57 proc casutil ; NOTE: The UUID '55b2c571-4656-034c-8001-e8a7b6ba55bd' is connected using session MYSESSION. 58 load casdata="s_heart" incaslib="implib" outcaslib="implib" casout="s_heart" replace ; NOTE: Performing serial LoadTable action using SAS Data Connector to Impala. NOTE: Cloud Analytic Services made the external data from s_heart available as table S_HEART in caslib implib. NOTE: The Cloud Analytic Services server processed the request in 0.974435 seconds. 59 list tables incaslib="implib"; NOTE: Cloud Analytic Services processed the combined requests in 0.016342 seconds. 60 quit ; Result of code execution: For more information about this topic: SAS® Viya™ 3.2: Deployment Guide

UttamKumar · ‎09-01-2020

Blobfuse is an open-source project developed to provide a virtual filesystem backed by Azure Blob storage. It is a virtual filesystem driver for Azure Blob storage. Blobfuse allows you to access blob data from Azure Storage Account through Linux Filesystem. With all excitement around SAS and Azure Cloud, Blobfuse could be a useful tool to access SAS data sets stored at Azure Blob Storage. Azure Blob Storage is a cost-effective and reliable service to store data. As a SAS user, you may use Azure Blob Storage to store all kinds of files (type) including “.sas7bdat” and “.sashdat” files. But there are no LIBNAME engine or CASLIB connector to directly read and write “.sas7bdat” and “.sashdat” files to Azure Blob Storage. With SAS Viya 3.5 release, SAS SPRE supports the ORC LIBNAME engine for the ORC data file at Blob Storage and ADLS FILENAME statement for other file types. CAS supports ORC and CSV data file access at Azure Blob Storage using ADLS CASLIB. The Azure Blobfuse could be a viable option for SAS users migrating SAS datasets (.sas7bdat files) to Azure Blob Storage. By using Blobfuse, a SAS user can NFS mount the Azure Blob Storage location to a Unix server as an additional filesystem. The Unix server which is hosting SAS Compute server or CAS servers. The Blobfuse NFS mount enables SAS users to use SAS LIBNAME name statement or PATH based CASLIB to access the .sas7bdat and .sashdta datafiles. The following diagram describes the data access path from Azure Blob Storage to SAS Compute Server and CAS Servers using Blobfuse. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. How to mount Blob Storage as a filesystem at Unix server Install the Blobfuse software. The Blobfuse is an open-source project and is free for download. With access to the internet, you can use the following statement to install Blobfuse at the Unix server. sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm sudo yum install blobfuse Prepare the Unix OS for NFS mount. Blobfuse provides native-like performance by buffering and caching the open files at OS in a temporary path file system. You can use most performant disk or a ramdisk as a temporary path for best performance. In Azure, you may use ephemeral disk (SSD) on VMs to provide a low-latency buffer for Blobfuse. sudo mkdir /mnt/blobfusetmp -p sudo chown utkuma:sasusers /mnt/blobfusetmp Configure Storage Account Credentials at Unix Server. Blobfuse requires Storage Account credentials stored in a text file in the following format. You can create a configuration file in the home directory or at a safe location with Storage Account name, key, and container name. tee ~/fuse_connection.cfg > /dev/null << "EOF" accountName utkuma3adls2strg accountKey 3R4oxwqyqrTqb4e4v7jsI2viFPkouln9qwNAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX containerName fsutkuma3adls2strg EOF sudo chmod 600 ~/fuse_connection.cfg Mount an empty directory to Azure Blob Storage. To mount a Blob Azure Storage container to the Unix server, it requires an empty folder. During the filesystem mount, you can use the “-o allow_other” option to enable access for other users. sudo mkdir /opt/fscontainer sudo chown utkuma:sasusers /opt/fscontainer sudo blobfuse /opt/fscontainer --tmp-path=/mnt/blobfusetmp --config-file=/home/utkuma/fuse_connection.cfg -o allow_other -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 Once Blob storage location mounted to the Unix server using Blobfuse NFS, you can view the data files under the NFS mount folder. [utkuma@intviya01 root]$ ls -l /opt/fscontainer total 0 -rwxrwxrwx. 1 root root 11671314432 Jul 31 14:01 dm_fact_mega_corp_10g.sas7bdat -rwxrwxrwx. 1 root root 1167204352 Aug 4 16:34 dm_fact_mega_corp_1g_1.sas7bdat -rwxrwxrwx. 1 root root 1167204352 Jul 31 11:34 dm_fact_mega_corp_1g.sas7bdat -rwxrwxrwx. 1 root root 1221188352 Aug 4 16:56 dm_fact_mega_corp_1G.sashdat -rwxrwxrwx. 1 root root 2334334976 Aug 4 16:35 dm_fact_mega_corp_2g_1.sas7bdat -rwxrwxrwx. 1 root root 2334334976 Jul 31 11:35 dm_fact_mega_corp_2g.sas7bdat -rwxrwxrwx. 1 root root 2442365904 Aug 4 16:57 dm_fact_mega_corp_2G.sashdat -rwxrwxrwx. 1 root root 5835661312 Aug 4 16:38 dm_fact_mega_corp_5g_1.sas7bdat -rwxrwxrwx. 1 root root 5835661312 Jul 31 11:36 dm_fact_mega_corp_5g.sas7bdat -rwxrwxrwx. 1 root root 6105880216 Aug 4 17:02 dm_fact_mega_corp_5G.sashdat -rwxrwxrwx. 1 root root 949092352 Jul 30 16:12 dm_fact_mega_corp.sas7bdat -rwxrwxrwx. 1 root root 131072 Jul 30 16:06 fish_sas.sas7bdat drwxrwxrwx. 2 root root 4096 Dec 31 1969 sample_data [utkuma@intviya01 root]$ The list of files in the above output is the same files located at ADLS2 Blob storage. Azure Blob data file access from SAS and CAS SAS LIBNAME to access Blob storage data. When a Blob Storage location is NFS mounted to SAS compute server (Unix), users can write a SAS LIBNAME statement with the NFS mounted folder to access .sas7bdat files. libname azshrlib "/opt/fscontainer" ; Proc SQL outobs=20; select * from azshrlib.fish_sas ; run;quit; While accessing the data files user will notice Blobfuse is buffering the data files at the temporary location used in Blobfuse mount statement. The buffer file is temporary and auto deletes. [utkuma@intviya01 ~]$ ls -l /mnt/blobfusetmp/root/ total 128 -rwxrwxrwx. 1 root root 131072 Jul 30 16:06 fish_sas.sas7bdat [utkuma@intviya01 ~]$ PATH CASLIB to access Blob Storage data. When a Blob Storage location is NFS mounted to the CAS Controller server (Unix), users can use PATH based CASLIB to access .sas7bdat and .sashdat files. CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); caslib azshcaslib datasource=(srctype="path") path="/opt/fscontainer" ; proc casutil outcaslib="azshcaslib" incaslib="azshcaslib" ; load casdata="dm_fact_mega_corp.sas7bdat" casout="dm_fact_mega_corp" replace; load casdata="dm_fact_mega_corp_1G.sashdat" casout="dm_fact_mega_corp_H" replace; list tables; quit; CAS mySession TERMINATE; While accessing the data files user will notice Blobfuse is buffering the data files at the temporary location used in Blobfuse mount statement. The buffer file is temporary and auto deletes. [root@intcas01 ~]# ls -l /mnt/blobfusetmp/root total 2250436 -rwxrwxrwx. 1 root root 1221188352 Aug 4 16:56 dm_fact_mega_corp_1G.sashdat -rwxrwxrwx. 1 root root 949092352 Jul 30 16:12 dm_fact_mega_corp.sas7bdat [root@intcas01 ~]# You may ask this question, why not use the DNFS type of CASLIB against the Blobfuse folder for parallel access to the Azure Blob storage file. The answer is Yes and No! In a test environment, I have noticed the DNFS type CASLIB using the Blobfuse folder is not working well. The CAS load from .sashdat file works in parallel, Blobfuse will cache the .sashdat file at each CAS nodes. However, a CAS table save to Blob Storage in .sashdat format creates a corrupted file and could not be loaded back to CAS. The PATH based CASLIB using the Serial method to save data at Blob Storage in .sashdat format works well. The CAS load from .sas7bdat files does not work with DNFS type CASLIB using Blobfuse. Though I have noticed if Blob Storage is NFS mounted to each CAS node, it will cache the data file at each CAS node during CAS load. The CAS table data save to Blob storage as .sas7bdat file is always in serial. The CAS controller writes data to Blob Storage, even if you have CAS nodes mounted to Blob Storage using Blobfuse. Notes: It is recommended to allows multiple CAS nodes to mount the same blob container for read-only scenarios. While a Blob container is mounted, the data in the container should not be modified by any process other than Blobfuse. This includes other instances of Blobfuse, running on this or other machines. Doing so could cause data loss or data corruption. Mounting other containers is fine. CAS load/save Performance using Blobfuse. The performance of CAS load from Blobfuse depends on the location of the CAS Unix server, and Azure Blob Storage account. For better data transfer between CAS server and Blob Storage, keep them close to each other as location-wise. The following is the run time and data transfer speed while moving data in and out from CAS to Azure Blob Storage. Notice the data save to Blob storage is slower compare to CAS load. CAS server (SMP) hosted at Azure VM machine (Instance Standard_D14_V2, 16 vCPUs, 112GB RAM , Max IOPS = 64x500). Notes: For better performance user should use latest series of Azure server (e.g. E32sV3 or E32ds_V4 ). Since I/O goes through the network, it’s recommended to enable the Accelerated Networking on Azure VM (for Azure supported Instance type). CAS server hosted at SAS RACE machine (4 vCPUs, 32 GB RAM) Many thanks to Erwan Granger for help and collaboration on this topic. Resource How to Mount Azure Blob Storage Container to a Unix Server

UttamKumar · ‎08-07-2020

The SAS MS-SQL Server Data Connector enables CAS to access the Azure SQL Database table. You can load and save CAS from the Azure SQL Database table using the serial and multi-node data load mechanism. This post identifies the components required for CAS to connect and access the Azure SQL Database. Azure SQL Database Azure SQL Database is a fully managed service-based database engine. It is based on the latest version of the MS-SQL server database engine. The SQL database could be a preferred choice for various cloud applications because it enables users to process both relational and non-relational structures (JSON, XML, Graph, etc.) data in the same database engine. Data access methods The CAS supports the Serial and Multi-node mechanism to load and save data to Azure SQL Database using SAS Data Connector to MS-SQL Server. In the case of Serial load, it requires the Data Connector at CAS controller server. In the case of the Multi-node, it requires the Data Connector at CAS controller and nodes. The following diagram describes the data path and components requires to access the Azure SQL Database table. CAS supports the Serial and Multi-Node load/save methods for Azure SQL Database. Select the image to see a larger version. Mobile users: To view the image, select the "Full" version at the bottom of the page. Pre-requisites SAS data connector to MS-SQL Server at CAS controller and nodes ODBC configuration setup at CAS controller and nodes User credentials to access to Azure SQL Database Azure SQL Database server firewall configured to allow Viya/CAS IPs connection Customized odbc.ini with a DSN name containing specific parameters CryptoProtocolVersion=TLSv1.2, TLSv1.1,TLSv1, SSLv3, SSLv2 AuthenticationMethod=1 EncryptionMethod=1 ValidateServerCertificate=0 SSL Library in the user’s LD_LIBRARY_PATH path. You can copy the following file to .../accessclients/lib folder or use the location in LD_LIBRARY_PATH path. /opt/sas/viya/home/SASEventStreamProcessingEngine/6.2/ssl/lib/libS0ssl28.so Data load from Azure SQL Database to CAS The following code describes the Serial and Multi-node data load from Azure SQL Database to CAS. The CASLIB statement uses the “sqlserver” data connector along with user id and password from the Azure SQL Database server and a customized DSN name from odbc.ini. Serial Load Code: CAS mySession SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); caslib azsqldb desc='Microsoft SQL Server Caslib' dataSource=(srctype='sqlserver', username='viyadep', password='xxxxxxxxxxxx', sqlserver_dsn="sqls_gel"); proc casutil incaslib="azsqldb" outcaslib="azsqldb"; load casdata="fish_sas" casout="fish_sas" replace; list tables; quit; CAS mySession TERMINATE; Log extract: ………… …… 96 97 proc casutil incaslib="azsqldb" outcaslib="azsqldb"; NOTE: The UUID '835901a3-aec7-eb4b-9549-cd84da9f5951' is connected using session MYSESSION. 99 load casdata="fish_sas" casout="fish_sas" replace; NOTE: Executing action 'table.loadTable'. NOTE: Performing serial LoadTable action using SAS Data Connector to SQL Server. NOTE: Cloud Analytic Services made the external data from fish_sas available as table FISH_SAS in caslib azsqldb. …. …………… Multi-node Load Code: CAS mySession SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); caslib azsqldb desc='Microsoft SQL Server Caslib' dataSource=(srctype='sqlserver', username='viyadep', password='xxxxxxxxxxxx', sqlserver_dsn="sqls_gel", numreadnodes=10,numwritenodes=10, DRIVER_TRACE="SQL", DRIVER_TRACEFILE="/tmp/sasdcpg.log", DRIVER_TRACEOPTIONS="TIMESTAMP|APPEND" ); proc casutil incaslib="azsqldb" outcaslib="azsqldb"; load casdata="fish_sas" casout="fish_sas" options=(sliceColumn="weight") replace; list tables; quit; CAS mySession TERMINATE; Log extract: ………… …… 2 proc casutil incaslib="azsqldb" outcaslib="azsqldb"; NOTE: The UUID '0e29a7e3-d26a-8f4f-96b7-83968eea7107' is connected using session MYSESSION. 83 load casdata="fish_sas" casout="fish_sas" options=(sliceColumn="weight") replace; NOTE: Executing action 'table.loadTable'. NOTE: Performing serial LoadTable action using SAS Data Connector to SQL Server. WARNING: The value of numReadNodes(10) exceeds the number of available worker nodes(2). The load will proceed with numReadNodes=2. NOTE: Cloud Analytic Services made the external data from fish_sas available as table FISH_SAS in caslib azsqldb. …. …………… Extract from trace file of Multi-node CAS load: ………… …… [viyadep@cas02 ~]$ more /tmp/sasdcpg.log | grep 'DRIVER SQL' 19.14.16.38: DRIVER SQL: "select * from geldb.dbo.fish_sas where 1=0 " on connection 0x0000000040034b80 19.14.16.50: DRIVER SQL: "select SLICE_SQL.Species, SLICE_SQL.Weight, SLICE_SQL.Length1, SLICE_SQL.Length2, SLICE_SQL.Length3, SLICE_SQL.Height, SLICE_SQL.Width from (select geldb.dbo.fish_sas.Species, geldb.dbo.fish_sas.Weight, geldb.dbo.fish_sas.Length1, geldb.dbo.fish_sas.Length2, geldb.dbo.fish_sas.Length3, geldb.dbo.fish_sas.Height, geldb.dbo.fish_sas.Width from geldb.dbo.fish_sas) SLICE_SQL where ( ( ( CAST(FLOOR (ABS (LOG10 (ABS (SLICE_SQL.Weight) ) ) ) AS BIGINT) % 2=0) or SLICE_SQL.Weight IS NULL) ) " on connection 0x0000000040042e20 [viyadep@cas02 ~]$ …. …………… The ODBC.ini with Customized DSN name: ………… …… #sqls_gel start [sqls_gel] Driver=/opt/sas/viya/home/lib64/accessclients/lib/S0sqls28.so Description=SAS Institute, Inc SQL Server Wire Protocol AlternateServers= AlwaysReportTriggerResults=0 AnsiNPW=1 ApplicationName= ApplicationUsingThreads=1 AuthenticationMethod=1 BulkBinaryThreshold=32 BulkCharacterThreshold=-1 BulkLoadBatchSize=1024 BulkLoadFieldDelimiter= BulkLoadOptions=2 BulkLoadRecordDelimiter= ConnectionReset=0 ConnectionRetryCount=0 ConnectionRetryDelay=3 CryptoProtocolVersion=TLSv1.2, TLSv1.1,TLSv1, SSLv3, SSLv2 Database=geldb EnableBulkLoad=0 EnableQuotedIdentifiers=0 EncryptionMethod=1 EnableScrollableCursors=4 FailoverGranularity=0 FailoverMode=0 FailoverPreconnect=0 FetchTSWTZasTimestamp=0 FetchTWFSasTime=1 GSSClient=native HostName=utkuma5sqlsrv.database.windows.net HostNameInCertificate= InitializationString= Language= LoadBalanceTimeout=0 LoadBalancing=0 LoginTimeout=15 LogonID=viyadep MaxPoolSize=100 MinPoolSize=0 PacketSize=-1 Password=XXXXXXXX Pooling=0 PortNumber=1433 QueryTimeout=0 ReportCodePageConversionErrors=0 SnapshotSerializable=0 TrustStore= TrustStorePassword= ValidateServerCertificate=0 WorkStationID= XMLDescribeType=-10 #sqls_gel end …. …………… Additional Resources Azure SQL Database MS-SQL Data Connector

UttamKumar · ‎06-18-2020

Hi Ramprakash, You can use PROC CAS statement to drop a CASLIB, provided you have the permission to drop the CASLIB. PROC CAS ; table.dropCaslib caslib="<CASLIBNAME>" quiet=TRUE run; quit; -Uttam

UttamKumar · ‎06-10-2020

With the SAS Viya 3.5 release, the SAS Viya programming environment (SAS 9.4) can read and write data files to ADLS and ADLS2 using FILENAME statement. The ADLS FILENAME statement’s Azure access method enables you to access data from ADLS/2. This article talks about various components involved for SAS FILENAME statement to access data files from Azure ADLS2. Pre-requisites User access to Azure Storage Account with Storage Blob Data Contributor role. Azure application with access permission to Azure Data lake and Azure Storage. Data Path diagram The following diagram describes the data access components and data path from SPRE/Base SAS 9.4 to ADLS2 using FILENAME statement. Select any image to see a larger version. Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post. Azure Configurations to access ADLS2 Please refer to the article CAS accessing ADLS2 and go through the “Azure configuration to access ADSL2” section. Azure information required to access ADLS2 data from CAS To access the Azure ADLS2 data file using FILENAME statement from SPRE/SAS, you need the following information. azuretenantid="XXXXXXXXXXXXXXXXXXXXXX" applicationid="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" accountname= "utkuma8adls2strg" filesystem="fsutkuma8adls2strg" SAS FILENAME statement Authentication An ORC LIBNAME uses device code authentication with Azure. Very first-time execution of ORC LIBNAME statement, it generates a misleading error message. It could not find the Azure access token at SAS compute server for SAS user. It created a .sasadls<>.json file under SAS user’s home directory with instructions to register the device and obtain the Azure Access token. The .sasadls<>.json file created for each SAS user. As part of the instruction, you need to login to https://microsoft.com/devicelogin Microsoft page and validate the listed device. Note: The default location for the Azure authorization key file is the home directory of the SAS user (e.g /home/viyademo01/.sasadls_.jason). Error from very first-time execution of ORC LIBNAME statement: 90 ; ERROR: Cannot obtain connection to ADLS. Check options and tokens. ERROR: To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code to authenticate. 91 After you log on to listed Microsoft page a series of windows provide further instruction to authorize the device. The subsequent attempts to access the data succeed and generates an Azure authorization key file at SAS compute server under the user’s home directory. [viyademo01@intviya01 ~]$ ls -l /home/viyademo01/.sasadls* -rw-------. 1 viyademo01 sasusers 5718 Feb 13 11:52 /home/viyademo01/.sasadls_100001.json [viyademo01@intviya01 ~]$ Data read/write to ADLS2 using SAS FILENAME statement The following code describes the data save from SPRE/SAS to ADLS2 Blob storage. In step-1, it executes the ADLS FILENAME statement. In step-2, it executes data steps to saves few lines into data file at ADLS2. Code: /* Step-1 */ options azuretenantid="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"; filename out adls "sasfile/example.txt" applicationid="XXXXXXXXXXXXXXXXXXXXXXX-XXXXXXXXXXX" accountname="utkuma8adls2strg" filesystem="fsutkuma8adls2strg" ; /* Step-2 */ data _null_; file out; put 'line 1'; put 'line 2'; run; Log extract: ………… ………… 83 options azuretenantid="b1c14d5c-3625-45b3-a430-9552373a0c2f"; /* 1*/ 84 85 /** File Name Statement to ADLS2 **/ 86 filename out adls "sasfile/example.txt" 87 applicationid="XXXXXXXXXXXXXXXXXXXXXXXX-xxxxxxxxxxxx" 88 accountname="utkuma8adls2strg" 89 filesystem="fsutkuma8adls2strg" 90 ; 91 92 data _null_; 93 file out; 94 put 'line 1'; 95 put 'line 2'; 96 run; NOTE: The file OUT is: Filename=sasfile/example.txt, Account Name=utkuma8adls2strg, File system=fsutkuma8adls2strg NOTE: 2 records were written to the file OUT. The minimum record length was 6. The maximum record length was 6. NOTE: DATA statement used (Total process time): real time 0.14 seconds cpu time 0.01 seconds 97 …. …………… The following screenshot describes the .txt data file saved to Azure ADLS2 by executing the above statement. Important Link: FILENAME statement: Azure Access Method Related Articles: SAS Viya 3.5 – CAS accessing Azure Data Lake files SAS Viya 3.5 – SAS ORC LIBNAME engine to ADLS2

UttamKumar · ‎06-05-2020

With the SAS Viya3.5 release, the SAS Viya programming environment (SAS 9.4) can read and write ORC data files to and from Azure Data Lake Storage 2 (ADLS2). We have a new Base SAS ORC engine to support data transfer between SAS 9.4 and ADLS2. This blog talks about various components involved for SAS to access orc files from Azure ADLS2. ORC data file Apache ORC (Optimized Row Columnar) is an open-source column-oriented data storage file format of the Apache Hadoop ecosystem. In column-based storage, you quickly access only the columns that you need to query, which can be highly efficient for analyzing big data. Storage system supported by SAS ORC engine Azure Data Lake Storage 2 (ADLS2) Linux for x64 supported by SAS Pre-requisites SAS ORC Engine at SAS Compute Server. (default install) User access to Azure Storage Account with Storage Blob Data Contributor role. Azure application with access permission to Azure Data lake and Azure Storage. Data Path diagram The following diagram describes the data access components and data path from Base SAS 9.4 to ADLS2. Select any image to see a larger version. Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post. Azure Configurations to access ADLS2 Before you can access the ADLS2 storage, you need to create a new or use existing Azure Application and Storage Account. The Azure application and Storage Account require appropriate configurations to access ADLS2 storage. The detailed steps with screenshots are discussed in the blog post (CAS accessing ADLS2) under “Azure configuration to access ADSL2” section. Azure information required to access ADLS2 data from CAS To access the Azure ADLS2 data file from SAS, you need the following information to execute an ORC LIBNAME statement. storage_account_name= “utkuma5adls2strg” storage_file_system="fsutkuma5adls2strg" storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXX" storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" SAS ORC LIBNAME Authentication An ORC LIBNAME uses device code authentication with Azure. Very first-time execution of ORC LIBNAME statement, it generates a misleading error message. It could not find the Azure access token at SAS compute server for SAS user. It created a .sasadls<>.json file under SAS user’s home directory with instructions to register the device and obtain the Azure Access token. The .sasadls<>.json file created for each SAS user. As part of the instruction, you need to login to https://microsoft.com/devicelogin Microsoft page and validate the listed device. Note: The default location for the Azure authorization key file is the home directory of the SAS user (e.g /home/viyademo01/.sasadls_.jason). Error from very first-time execution of ORC LIBNAME statement: 90 ; ERROR: Invalid physical name for library ORCLIB. ERROR: Error in the LIBNAME statement. 91 After you log on to listed Microsoft page a series of windows provide further instruction to authorize the device. The subsequent attempts to access the data succeed and generates an Azure authorization key file at SAS compute server under the user’s home directory. [viyademo01@intviya01 ~]$ ls -l /home/viyademo01/.sasadls* -rw-------. 1 viyademo01 sasusers 5718 Feb 13 11:52 /home/viyademo01/.sasadls_100001.json [viyademo01@intviya01 ~]$ Data read/write from SAS to ADLS2 The following code describes the data save and read from ADLS2 Blob storage to SAS 9.4. In step-1, it executes the ORC LIBAME statement. In step-2, it saves SAS data sets to ADLS2 storage in ORC data file format. In step-3, it reads saved ORC file from ADLS2 to SAS. Code: /* Step-1 */ libname orclib ORC "/sample_data" storage_account_name = "utkuma5adls2strg" storage_file_system = "fsutkuma5adls2strg" storage_dns_suffix = "dfs.core.windows.net" storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" DIRECTORIES_AS_DATA=YES FILE_NAME_EXTENSION=(orc ORC) ; /* Step-2 */ data orclib.fish_orc; set sashelp.fish; run; /* Step-3 */ Proc SQL ; select * from orclib.fish_orc ; run; quit; Log extract: ………… ………… 82 data orclib.fish_orc ; 83 set sashelp.fish ; 84 run; NOTE: There were 159 observations read from the data set SASHELP.FISH. NOTE: The data set ORCLIB.fish_orc has 159 observations and 7 variables. NOTE: DATA statement used (Total process time): real time 0.90 seconds cpu time 0.11 seconds 85 …. …………… The following screenshot describes the ORC data file saved to Azure ADLS2 by executing the above statement. The following screenshot describes the ORC data file read from Azure ADLS2 by executing the above statement. SAS reading orc files from an ADLS2 sub-folder The SAS ORC engine supports reading n-number of orc files from an ADLS2 sub-folder. The LIBNAME definition support “DIRECTORIES_AS_DATA=YES” to enable read the sub-folder name as one table. The following screenshot describes more than one orc file in an ADLS2 sub-folder. These data files can be read at the sub-folder level and all data will end-up in a single SAS table. The following code describes multiple orc file read from an ADLS2 sub-folder into a SAS table. The LIBNAME statement using “DIRECTORIES_AS_DATA=YES” to enable the read. Code: /* Step-1 */ libname orclib ORC "/sample_data" storage_account_name = "utkuma5adls2strg" storage_file_system = "fsutkuma5adls2strg" storage_dns_suffix = "dfs.core.windows.net" storage_application_id="XXXXXXXXXXXXXXXXXXXXXXXXX-XXXXXXXX" storage_tenant_id="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" DIRECTORIES_AS_DATA=YES FILE_NAME_EXTENSION=(orc ORC) ; /* Step-2 */ Proc SQL ; select * from orclib.fish_n ; run; quit; Data Compression The SAS ORC engine supports default ZLIB data compression while saving data to ADLS2. SAS can read compression types SNAPPY, LZO, LZ4, and ZSTD. However, these types are not supported for Write access. COMPRESS=NO | ZLIB Restrictions There are some restrictions for SAS features and ORC file features. Few of them are listed here, for a more comprehensive list see the documentation. Data update to orc data file/s is not supported. SAS Indexes, encryption, DS2 or Fed SQL procedures are not supported. Compound data types, and Hive data partitioning are not supported. Important Link: Base SAS ORC LIBNAME Engine Related Blog: SAS Viya 3.5 – CAS accessing Azure Data Lake files

UttamKumar · ‎05-08-2020

Hi Rahul, Here is an sample code with LIBNAME reference to write/read SAS datasets to/from Hive using spark engine. ========= option set=SAS_HADOOP_JAR_PATH="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/config/data/hadoop/lib/spark"; option set=SAS_HADOOP_CONFIG_PATH="/opt/sas/viya/config/data/hadoop/conf"; options sastrace=',,,d' sastraceloc=saslog nostsuffix sql_ip_trace=(note,source) msglevel=i; options DBIDIRECTEXEC; libname hivelib clear; libname hivelib hadoop server="server.example.com" user="hadoop" database=default subprotocol=hive2 /*use fetch conversion and strict mode control to have an error message when costly ORDER is generated*/ properties="hive.fetch.task.conversion=minimal;hive.fetch.task.conversion.threshold=-1;hive.mapred.mode=strict;hive.execution.engine=spark"; /* save SAS table to HIVE */ data hivelib.prdsal2; set sashelp.prdsal2; run; ========== -Uttam

UttamKumar · ‎05-05-2020

Yes ! you can save CAS table/datastes to Hive. You can use SAS Hadoop EP for parallel data save and load from CAS to Hive. You can also use SAS Hadoop/Hive Data Connector for serial data save and load from CAS to hive. -Uttam

UttamKumar · ‎03-26-2020

With SAS Viya3.5 release CAS can read and write ORC and CSV data files to Azure Data Lake Storage (ADLS2). There is a new data connector called “SAS ORC Data Connector” to facilitate the data transfer between CAS and ADLS2. The SAS ORC Data Connector enables you to load data from Apache Optimized Row Columnar table into CAS. This Data connector can be used with a Path or Azure ADLS2 CASLIB. This blog talks about various components involved to access Azure ADLS2 data files from CAS. CAS supported data file type at ADLS2 ORC CSV Pre-requisites SAS ORC Data Connector installed at CAS nodes. User access to Azure Storage Account with Storage Blob Data Contributor role. Azure application with access permission to Azure Data lake and Azure Storage. CAS supported data access methods for ADLS2 The CAS supports a mixed-mode to save/load ORC and CSV files to/from ADLS2. It supports the serial method to save and load for ORC data files. However, it supports the serial method to save and the parallel method to load for CSV data files. To save/load the ORC data file to CAS, only the CAS controller requires Azure access token file. To load the CSV files to CAS, the CAS Controller and all Node requires Azure access token file. To save a CAS table as CSV file to ADLS2, only the CAS controller requires Azure access token file. The following pics describe the ORC and CSV data files access from CAS to ADLS2. CAS supports the serial data load/save method for ADLS2 ORC files: Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. CAS supports the serial data save and the parallel data load for ADLS2 CSV file: Azure Configurations to access ADLS2 from CAS Before you can access the ADLS2 storage, you need to create a new or use existing Azure Application and Storage Account. The Azure application and Storage Account require appropriate configurations to access ADLS2 storage. The following options and screenshots describe the required configurations. Azure Application Authentication The Azure application used by CAS users to access ADLS2 data requires platform configuration with the following options: Web Redirect URIs : http://localhost Mobile and Desktop Applications Redirect URIs : https://login.microsoftonline.com/common/oauth2/nativeclient Supported Account Types: Account in this organizational directory only (SAS only- Single tenant) Advance Settings Default client type: Treat application as a public client. = YES Azure Application API Permission The Azure application which is used by CAS users to access ADLS2 data requires API permission to access Azure Data Lake and Azure Storage. These permissions are managed and maintained by the Azure cloud administrator (Tenant administrator). You need to request your Azure cloud administrator to grant these permissions for the registered Azure application. Azure Storage Account configuration There must be an Azure Storage Account (ADLS2) with "Contributor" and "Blob Storage Data Contributor" role assigned to required users. The following screenshot describe an Azure Storage Account (ADLS2) named “utkuma1” with Blob Storage Data Contributor role assigned to users. The role can be added just for the immediate resource or can be inherited from the user’s Azure subscription/application role. A filesystem at Storage Account container You require a filesystem created under Storage Account Container to store the data files. The Container is a massively scalable data lake storage. The following screenshot describes a file system under Storage Account with a data file stored under a folder. Azure information required to access ADLS2 data from CAS To access Azure ADLS2 data file from CAS you need the following information to create an adls CASLIB. accountname='utkuma1' filesystem="fsutkuma" tenantid="XXXXXXXXXXXXXXXXXXXXXX" applicationId="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" Azure Storage Account name is the value for the accountname= parameter. The file system name under Storage Account Container is the value for the filesystem= parameter. The Tenant Id and Application Id are available at your Azure Application overview page. You can navigate to your Azure Application overview page by using the following path to obtain Tenant Id and Application Id. Azure Services -> Azure Active Directory -> App Registration -> ( Select your application) -> Overview ADLS CASLIB Authentication An ADLS CASLIB uses device code authentication with Azure. Very first-time access to ADLS CASLIB generates an error message to verify the hardware device code. It requires you to log in to https://microsoft.com/devicelogin Microsoft page and validate the listed device. Note: The default location for the Azure authorization key file is home directory of the user ‘cas’ (/home/cas/.sasadls_.jason). Error from very first-time access to ADLS CASLIB: ERROR: To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXX to authenticate. ERROR: Pending end-user authorization. ERROR: The action stopped due to errors. After you log on to listed Microsoft page a series of windows provide further instruction to authorize the device. The Subsequent attempts to access the data succeed and generates an Azure authorization key file saved to the CAS controller server. The home directory of the user “cas” is the default location for the Authorization key with file name name “.sasadls_.json”. The location for the Azure Authorization key can be customized by including cas.AZUREAUTHCACHELOC=”path” in casconfig_usermod.lua configuration file. The specified path should be accessible to the user “cas”. [cas@intcas01 ~]$ ls -l /home/cas/.sasadls* -rw-------. 1 cas sas 5718 Feb 13 11:52 /home/cas/.sasadls_2001.json [cas@intcas01 ~]$ For Multi-node CAS environment In a multi-node CAS environment, to load CSV file to CAS, you need to authenticate devices from every CAS node (Unix servers). You can concatenate /home/cas/.sasadls_.jason file at each CAS node to view the separate device code. Use the same Microsoft page to authenticate each device code. Once authenticated, the subsequent CAS load statement succeeds. The Azure authentication key file saved to the default folder at each CAS node. The following example shows the device code from .sasadls_.jason file when very first time CVS data file load statement executed. The file contains the device code just before receiving the Authorization key. The file with device code located at each CAS node(1-5). The Authorization key overwrites the same file. [cas@intcas01 ~]$ more .sasadls_2001.json {"refresh_token":"","device_code":"AAQABAAEAAABeAFzDwllzTYGDLh_qYbH8ZUUcJrv7qCsAYFeXjEb29MsbYkBHzTSRa6v-DLNaOwtCDt9VvRqSXPNMuKsTxYF9fGLIxhsFrIjZL0gAA","message": "To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AEGEK2MB4 to authenticate. ","oauth":"","resource":""} [cas@intcas01 ~]$ ssh sascas02 [cas@intcas02 ~]$ more .sasadls_2001.json {"refresh_token":"","device_code":"AAQABAAEAAABeAFzDwllzTYGDLh_qYbH8bY1nIfYYv05NmZxCkKXIooCgXKGyJFAozb3uuxWL0HYKGduBxYqUcBaziYgAA","message": "To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code ARVADUQW3 to authenticate. ","oauth":"","resource":""} [cas@intcas02 ~]$ exit logout Connection to sascas02 closed. ….. ……………. ……………....... [cas@intcas01 ~]$ ssh sascas05 Last login: Thu Feb 13 12:54:06 2020 from sascas01.race.sas.com [cas@intcas05 ~]$ more .sasadls_2001.json {"refresh_token":"","device_code":"AAQABAAEAAABeAFzDwllzTYGDLh_qYbH8sxwfsLW_gkKhBchSZL_bLYIEN4Hwave8VyYQ43jJVWXrgCWNE7imCiFajhhyr6wpggAA","message": "To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code A7G77QXMK to authenticate. ","oauth":"","resource":""} Data load from ADLS2 to CAS The following code describes the data save/load from ADLS2 Blob storage to CAS. In a CAS session, it creates an “adls” source type CASLIB (session-level). In step-1, it saves CAS tables to ADLS2 storage in ORC and CSV data file format. In step-2, it loads the ORC and CSV data files from ADLS2 storage to CAS. Code: CAS mySession SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); /* CASLIB with adls storage */ caslib ADLS2 datasource=( srctype="adls" accountname="utkuma1" filesystem="fsutkuma" dnsSuffix=dfs.core.windows.net timeout=50000 tenantid="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx" applicationId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx" ) path="/testdata" subdirs; proc casutil incaslib="ADLS2"; list files ; run; quit; /* Step-1 */ proc casutil incaslib="DM" outcaslib="ADLS2"; save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.orc" replace ; save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.csv" replace ; run; quit; /* Step-2 */ proc casutil incaslib="ADLS2" outcaslib="ADLS2"; load casdata="DM_FACT_MEGA_CORP_1G.orc" casout="DM_FACT_MEGA_CORP_1G_ORC" replace ; load casdata="DM_FACT_MEGA_CORP_1G.csv" casout="DM_FACT_MEGA_CORP_1G_CSV" replace ; list tables ; run; quit; CAS mySession TERMINATE; Log extract: ………… ………… 83 proc casutil incaslib="DM" outcaslib="ADLS2"; NOTE: The UUID '20318d8b-fe8c-b849-9f32-1cb4ac75f0b0' is connected using session MYSESSION. 84 save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.orc" replace ; NOTE: Executing action 'table.save'. NOTE: Cloud Analytic Services saved the file DM_FACT_MEGA_CORP_1G.orc in caslib ADLS2. ………. …………… 85 save casdata="DM_FACT_MEGA_CORP" casout="DM_FACT_MEGA_CORP_1G.csv" replace ; NOTE: Executing action 'table.save'. NOTE: Cloud Analytic Services saved the file DM_FACT_MEGA_CORP_1G.csv in caslib ADLS2. …. ……………. 83 proc casutil incaslib="ADLS2" outcaslib="ADLS2"; NOTE: The UUID '5fcb51ee-020b-1349-b0b4-2bbff70f1605' is connected using session MYSESSION. 84 84 ! load casdata="DM_FACT_MEGA_CORP_1G.orc" casout="DM_FACT_MEGA_CORP_1G_ORC" replace ; NOTE: Executing action 'table.loadTable'. NOTE: Cloud Analytic Services made the external data from DM_FACT_MEGA_CORP_1G.orc available as table DM_FACT_MEGA_CORP_1G_ORC in caslib ADLS2. ………… ……………… 85 ! load casdata="DM_FACT_MEGA_CORP_1G.csv" casout="DM_FACT_MEGA_CORP_1G_CSV" replace ; NOTE: Executing action 'table.loadTable'. NOTE: Cloud Analytic Services made the external data from DM_FACT_MEGA_CORP_1G.csv available as table DM_FACT_MEGA_CORP_1G_CSV in caslib ADLS2. …. …………… The following screenshot describes ORC and CSV data file saved to Azure ADLS2 by executing above statement. Data Compression The SAS ORC Data connector supports data compression while saving data to ADLS2. By default, “ZLIB” is used to compress the data when no values is specified with COMPRESS= in CASLIB statement. COMPRESS=YES | NO | ZLIB CAS load performance The data load performance between CAS and ADLS2 depends on CAS resources and network traffic speed. The following test results are from an environment where CAS hosted at RACE servers accessing SAS federated Azure environment. Test environment: RACE CAS Servers = 1 + 4 Nodes – 32 GB Mem with 4 CPU on each node. Run time for ORC file (CAS support serial data save/load): Run time for CSV file (CAS support Serial data save and Parallel data load): Issues When Converting Data from ORC to CAS Character-based columns (CHAR, VARCHAR, STRING) that contain date, time, or timestamp values are not automatically converted to their respective SAS format types. They remain as character-string values. Important Links: Azure Data Lake Storage Data Source ORC Data Connector CAS Configuration File Options Reference (cas.AZUREAUTHCACHELOC=)

UttamKumar · ‎03-25-2020

With SAS Viya3.5 release, there is a new CAS action to improve the performance of data transfer between Spark/Hadoop and CAS. The new CAS action is available with SAS Data Connect accelerator to Hadoop/Spark. A CAS session initiates a continuous SAS Embedded Process for Spark at Hadoop cluster for a list of table data transfer between CAS and Spark/Hadoop. The EPCS continuous session provides tight integration between CAS and Spark by processing multiple execution requests without having to start and close the EP process for each request. The continuous SAS EP process for Spark at the Hadoop cluster is for each CAS session. The continuous EP process improves the performance of data transfer as it does not have to start a new EP process for each request. The initial resources allocated once at the hadoop cluster and subsequent Spark request utilize it. List of CAS actions for SAS Embedded Process for Spark: startSparkEP - to start SAS EP for Spark continuous session. stopSparkEP - to stop SAS EP for Spark continuous session. Pre-requisites: SAS Data Connect Accelerator to Hadoop/Spark installed at CAS nodes. SAS EP installed at Hadoop cluster. Spark-2 and hive metastore available at hadoop cluster. The following code describes the data load from Hadoop/Spark to CAS using continuous EP session. In a new CAS session, it creates a Hadoop CASLIB (session-level) with parallel transfer mode and spark platform. The Hadoop jar path includes the Spark jars. Step 1 starts the continuous Spark SAS EP process for ‘cashive’ CASLIB under the ‘mySession’ session. Step 2 loads a list of hive data tables to CAS under the same CAS session using the continuous Spark EP process. Step 3 closes the continuous Spark SAS EP process. Code: CAS mySession SESSOPTS=(messagelevel=all CASLIB="public" TIMEOUT=999 LOCALE="en_US"); /* Define a caslib for parallel data transfer */ caslib cashive datasource=(srctype="hadoop", server="server.example.com", username="hadoop", dataTransferMode="parallel", platform="spark", hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf", hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/config/data/hadoop/lib/spark", schema="default" , dfdebug="epall"); proc cas; /* Step 1*/ session mySession; action sparkEmbeddedProcess.startsparkep caslib="cashive" executorInstances=16, executorCores=2 ; run; quit; proc casutil incaslib="cashive" outcaslib="cashive"; /* Step 2 */ load casdata="dm_fact_mega_corp_1g" casout="dm_fact_mega_corp_1g" replace ; load casdata="dm_fact_mega_corp_2g" casout="dm_fact_mega_corp_2g" replace ; load casdata="dm_fact_mega_corp_5g" casout="dm_fact_mega_corp_5g" replace ; load casdata="dm_fact_mega_corp_10g" casout="dm_fact_mega_corp_10g" replace ; load casdata="dm_fact_mega_corp_20g" casout="dm_fact_mega_corp_20g" replace ; run ; quit; proc cas; /* Step 3*/ session mySession; action sparkEmbeddedProcess.stopsparkep caslib="cashive"; run; quit; cas mySession terminate; Log extract: ………… …….. 82 83 proc cas; 84 session mySession; 85 sparkEmbeddedProcess.startsparkep 86 caslib="cashive"; 87 run; NOTE: Active Session now mySession. NOTE: Added action set 'sparkEmbeddedProcess'. NOTE: SAS Embedded Process implementation version:[1.7.23]. Full version:[17000]. NOTE: The SAS Embedded Process for Spark continuous session started. Tracking URL: http://example.server.com:8088/proxy/application_1579556421270_0036/ 88 quit; ………. …………… 82 83 proc casutil incaslib="cashive" outcaslib="cashive"; NOTE: The UUID 'ac7fe4f0-3140-9142-ae9d-d3b72e86f2cb' is connected using session MYSESSION. 84 load casdata="dm_fact_mega_corp_1g" casout="dm_fact_mega_corp_1g" replace ; NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop. NOTE: SAS Embedded Process tracking URL: http://example.server.com:8088/proxy/application_1579556421270_0036/ NOTE: Job Status ......: SUCCEEDED …. ……………. NOTE: The Cloud Analytic Services server processed the request in 84.345992 seconds. 85 load casdata="dm_fact_mega_corp_2g" casout="dm_fact_mega_corp_2g" replace ; NOTE: Performing parallel LoadTable action using SAS Data Connect Accelerator for Hadoop. NOTE: SAS Embedded Process tracking URL: http://example.server.com:8088/proxy/application_1579556421270_0036/ NOTE: Job Status ......: SUCCEEDED …. …………… Notice the log, each data load CAS action using the same Hadoop application/Job Id to load the data from Hadoop/Spark to CAS. As a result of invoking the “startSparkEP” CAS action, you can notice a continuous process at the Hadoop cluster scheduler, just like the Spark process. The following screenshot describes the same. Notice the number of running containers and allocated memory. The EPCS process will allocate these resources depending on the executorInstances= parameter. These resources impact the performance of the EPCS process. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. There are additional parameters with “startSparkEP” CAS action which lets you control the resources used at the Hadoop cluster for subsequent CAS actions. It's suggested to use taskCores=1 or leave it alone with default value for better performance. The following example describes the same. sparkEmbeddedProcess.startsparkep caslib="cashive", executorInstances=4, executorCores=2, executorMemory=2, taskCores=1; CAS load performance with EPCS session: The data load performance between CAS and Hadoop/Spark depends on HW resources and network traffic speed between CAS servers and the Hadoop cluster. The following test results are from an environment where both CAS and Hadoop cluster hosted at RACE servers. Test environment: RACE CAS Servers = 1 + 4 Nodes – 32 GB Mem with 4 CPU on each node. RACE Hadoop Cluster = 1 + 3 Nodes – 16 GB Mem with 2 CPU on each node. Run time: Important Link: SAS Embedded Process for Spark Action Set:

Online Status	Offline
Date Last Visited	yesterday

Using Azure Service Principal to access DataBricks from SAS Viya Appli...

Single-Sign-On(SSO) access from SAS Viya Application to Azure DataBric...

Re: SAS Viya Cloud Data Exchange Deployment and Configuration (Part-2)

Re: SAS Viya Cloud Data Exchange Deployment and Configuration (Part -1...

Re: SAS Viya Cloud Data Exchange Deployment and Configuration (Part -1...

Re: SAS Cloud Data Exchange for the SAS Viya Platform

Re: SAS Cloud Data Exchange for the SAS Viya Platform

Configuring MS-SQL Server Data Source at Remote Data Agent (Cloud Data...

SAS CDE ODBC Driver for Windows Third-Party Applications

Configuring DB2 Data Source at Remote Data Agent (Cloud Data Exchange)

Using NFS Premium shares in Azure Files for SAS Viya on Kubernetes

Accessing SPD Engine Data using Hive

Using Azure Service Principal to access DataBricks from SAS Viya Appli...

SAS Cloud Data Exchange for the SAS Viya Platform

Single-Sign-On(SSO) access from SAS Viya Application to Azure DataBric...

SAS Viya accessing S3 with EKS Service Account

Configuring MS-SQL Server Data Source at Remote Data Agent (Cloud Data...

SAS Viya/CAS accessing Azure HDInsight

Parallel data load to SAS Viya

SAS® Viya™ Caslib

Loading a subset of a data table to CAS

Executing SAS Data Step in CAS

Connecting SAS Viya with Cloudera Impala

Azure Blobfuse to access Blob Storage

SAS Viya/CAS accessing Azure SQL Database

Re: SAS® Viya™ Caslib

SAS Viya 3.5 – SAS FILENAME statement to ADLS2

SAS Viya 3.5 – SAS ORC LIBNAME engine to ADLS2

Re: SAS Viya 3.5 : CAS Action EPCS for SPARK

Re: SAS Viya 3.5 : CAS Action EPCS for SPARK

SAS Viya 3.5 : CAS accessing Azure Data Lake files.

SAS Viya 3.5 : CAS Action EPCS for SPARK

SAS Innovate 2025