SAS language capabilities to efficiently process data residing in CAS

3 Likes

With SAS Viya, you have the possibility, depending on the SAS solution you are using, to access visual interfaces or a programming environment in order to interact with the new in-memory analytics engine called Cloud Analytic Services (CAS). Among the programming languages that can work with Viya and CAS, one can choose Python, Lua, Java, etc. but one can still use the SAS Language to manipulate huge volumes of data, in-memory and in parallel.

From a data management standpoint, the SAS Language provides different ways to manipulate data residing in CAS.

The 4 main (data processing) language components that leverage CAS efficiently are:

DATA step
DS2
FedSQL
Transpose

Let’s walk through those language components.

DATA Step

DATA step is the core of SAS Language. And good news, it is CAS-enabled, and provides much more features than DATA step in the LASR or Hadoop worlds.

“People thought that it couldn't be done and we've finally parallelized the DATA step across all of your data.” -- Paul Kent at Strata+Hadoop 2016 (YouTube)

In SAS Viya, the DATA step runs in one of two environments:

In a SAS client session, where it runs in a single thread using the SAS V9 engine.
In a CAS server session, where it runs in a single thread or in multiple threads using the CAS engine.

When the DATA step runs in CAS, it runs either in a single thread or in multiple threads:

Single thread – the DATA step runs as one process on an entire, unpartitioned table. Running the DATA step in a single thread is not recommended for very large tables.
Multiple threads – the DATA step runs in multiple threads, in parallel. Each thread operates on a different part of the data. The DATA step can run in multiple threads in a single-machine CAS server or in a distributed CAS server.

For a DATA step to run in CAS, the following must be true:

Both the input and output data must be CAS tables. CAS tables are named using a CAS engine libref.
The output CAS table must contain at least 1 variable: CAS does not support zero-column tables.
All language elements must be supported in the CAS DATA step.

Not all SAS language elements are supported in CAS. Language elements that are supported in CAS are a subset of the SAS language. If you use a language element that is not supported in the CAS DATA step, then the DATA step will automatically run in SAS rather than generate an error.

In addition, some DATA step language elements that rely on inter-row dependent operations might return unexpected results when they are used in a multi-threaded context (RETAIN, LAG, DIF, etc.). Some others are not supported (POINT=, OBS= for example).

A DATA step can be submitted in CAS using one of the following methods:

A SAS MVA client running a DATA step (a hidden CAS action called “dataStep.runBinary” is then run in CAS)
- Data sources and targets are identified using CAS engine librefs
A CAS action “dataStep.runCode” called from the SAS CASL language specification or Python or Lua, etc.
- Data sources and targets are identified using CASLIB names

Link to the documentation: Data Step in CAS

Example:

libname libnir cas caslib=casnir ;
data libnir.result(copies=0 replace=yes) ;
   merge libnir.megacorp_500m(in=a) libnir.right2(rename=(idr=id)) ;
   by id ;
   if a ;
run ;

This DATA step will generate the following message in the log indicating that it runs in CAS:

NOTE: Running DATA step in Cloud Analytic Services.

DS2

DS2 is a SAS proprietary programming language that is appropriate for advanced data manipulation. It also includes additional data types, ANSI SQL types, programming structure elements, and user-defined methods and packages. The DS2 language and the DATA step share core features. In addition, DS2 is intended to run in SAS as well as in other third-parties that support the SAS Embedded Process like Hadoop, Teradata and some others.

CAS is a perfect environment for DS2 since the THREAD / DATA program paradigm elegantly maps two stage map/reduce type logic for you on the grid. THREAD program logic runs wide on all workers while any logic in the DATA program runs on one worker gathering data from all the THREAD programs.

A DS2 program is classified as either a serial program, parallel program, or a parallel-serial program based on the type of programs and type of data manipulation operations that it possesses.

DS2 Serial Program - program that contains only serial operations.
- Has only a data program (no thread program).
- Data program contains serial data manipulation operations.
DS2 Parallel Program - program that contains only parallel operations.
- Has a thread program and a data program.
- Thread program contains parallel data manipulation operations.
- Data program contains no data manipulation operations. That is, the data program does not contain any statement besides SET FROM and OUTPUT.
DS2 Parallel-Serial Program - program contains both parallel and serial operations.
- Has a thread program (parallel stage) and a data program (serial stage).
- Thread program contains parallel data manipulation operations.
- Data program contains serial data manipulation operations. That is, the data program contains at least one statement besides SET FROM and OUTPUT.

Unlike the DATA step, you don’t really need to care about what language element is supported or not in CAS. With DS2, you only specify if it runs entirely in CAS or not by using the SESSREF (or SESSUUID) option, and almost all DS2 language elements are supported in CAS, with only a few exceptions (some random functions like RANUNI, RANCAU, etc. that are not supported in DATA step either). Consequently, if you run DS2 in CAS, you only deal with CAS tables. There is no possibility to combine CAS tables with SAS tables.

DS2 can be submitted in CAS using one of the following methods:

A SAS MVA client running a DS2 procedure (a hidden CAS action called “ds2.run” is then run in CAS)
- Data sources and targets are identified using CASLIB names
A CAS action “ds2.runDS2” called from the SAS CASL language specification or Python or Lua, etc.
- Data sources and targets are identified using CASLIB names

Link to documentation: DS2 in CAS

Example:

caslib casnir PATH="/opt/sas/data/" TYPE=path SESSREF=session1 ;

/* load tables in CAS before using DS2... */

proc ds2 sessref=session1 ;
   thread join_th / overwrite=yes ;
      method init() ;
         put 'thread:' _nthreads_= ;
      end ;
      method run() ;
         merge casnir.megacorp_500m(in=a) casnir.right2(in=b rename=(idr=id)) ;
         by id ;
         if a and b ;
      end ;
   endthread ;
   data casnir.result(replication=0) / overwrite=yes ;
      dcl thread join_th t ;
      method run() ;
         set from t threads=2 ;
      end ;
   enddata ;
run ;
quit ;

This DS2 procedure will generate the following message in the log indicating that it runs in CAS:

NOTE: DS2: Running on CAS due to "sessref".

FedSQL

SAS FedSQL is a SAS proprietary implementation of ANSI SQL:1999 core standard. It is the only way for running SQL in CAS.

The FEDSQL procedure in CAS only supports 3 SQL statements:

CREATE TABLE AS
DROP TABLE
SELECT

Like the DS2 language, you specify if FedSQL is running entirely in CAS or not by using the SESSREF (or SESSUUID) option. Consequently, if you run FedSQL in CAS, you only deal with CAS tables. There is no possibility to join CAS tables with SAS tables. In addition, the current version of FedSQL working in CAS only supports CAS tables and no other data source.

FedSQL can be submitted in CAS using one of the following methods:

A SAS MVA client running a FEDSQL procedure (a hidden CAS action called “fedsql.execDirect” is then run in CAS)
- Data sources and targets are identified using CASLIB names
A CAS action “fedsql.execDirect” called from the SAS CASL language specification or Python or Lua, etc.
- Data sources and targets are identified using CASLIB names

Link to documentation: FedSQL in CAS

Example:

caslib casnir PATH="/opt/sas/data/" TYPE=path SESSREF=session1 ;

/* load tables in CAS before using FedSQL... */

proc fedsql sessref=session1 _method _cost ;
   create table casnir.joinResult {options replication=0 replace=true} as
   select * from casnir.megacorp_500m as a left join casnir.right2 as b on a.id=b.idr ;
quit ;

This FEDSQL procedure will generate the following message in the log indicating that it runs in CAS:

NOTE: FEDSQL: Running on CAS due to "sessref".

Regarding join operations, FEDSQL provides 3 types of algorithms:

Hash join
Merge join
Nested-loop join

One can know the join algorithm chosen by FEDSQL by specifying the _METHOD option like in the example above.

Usually, hash or merge joins are chosen by the FEDSQL optimizer when equality joins are requested. The hash/merge decision depends generally on the cardinality of the key variables, on the join type (inner, left, etc.) and/or on the table size.

Nested loop is a last resort when hash or merge cannot be done. For example if the join condition is an inequality or perhaps contains a complex SQL expression.

Transpose

Transpose is one of the favorites operations used by data scientists to prepare their data before designing predictive models using SAS Analytics. Historically, transpose has been one of the most difficult operation to push outside of SAS, using in-database capabilities. It is now possible with 9.4M4 (pre-production in 9.4M3) in Teradata and Hadoop.

That said, the transpose capability was a “must-have” operation in CAS. And fortunately the TRANSPOSE procedure is CAS-enabled. To run a transpose operation in CAS, source and target tables must be CAS tables. Also some statements are required and some others are not supported:

COPY statement is not supported
ID statement is required
KEEP data set option is not supported

Transpose can be submitted in CAS using one of the following methods:

A SAS MVA client running a TRANSPOSE procedure (a hidden CAS action called “transpose.transpose” is then run in CAS)
- Data sources and targets are identified using CAS engine librefs
A CAS action “transpose.transpose” called from the SAS CASL language specification or Python or Lua, etc.
- Data sources and targets are identified using CASLIB names

Link to documentation: Transpose in CAS

Example:

libname libnir cas caslib=casnir ;

proc transpose data=libnir.megacorp_500m out=libnir.megacorp_500m_tr(copies=0) ;
    by id ;
    id facilityregion facilitystate ;
    var revenue ;
run ;

This TRANSPOSE procedure will generate the following message in the log indicating that it runs in CAS:

NOTE: The Cloud Analytic Services server processed the request in 10.614462 seconds.

Conclusion

CAS is not only the new in-memory analytics engine for SAS, it also provides great in-memory and parallel data manipulation features.

There is also a new procedure, and a new language extension, that we haven’t talked about in this blog, and that allow users to perform data manipulation in CAS. It is CASL and the CAS procedure. CASL is the language specification that enables you to access the CAS server. CASL is an integral part of the CAS procedure.

Basically, CASL and the CAS procedure enable users to run CAS actions. Instead of running a DATA step, a DS2 procedure, a FEDSQL procedure or a TRANSPOSE procedure, one can run a CAS procedure to call CAS actions that perform the same operation (dataStep.runCode for DATA step, ds2.runDS2 for DS2, fedSql.execDirect for FedSQL and transpose.transpose for transposing data).

SAS language capabilities to efficiently process data residing in CAS

Free course: Data Literacy Essentials

Get Started