With SAS Viya, you have the possibility, depending on the SAS solution you are using, to access visual interfaces or a programming environment in order to interact with the new in-memory analytics engine called Cloud Analytic Services (CAS). Among the programming languages that can work with Viya and CAS, one can choose Python, Lua, Java, etc. but one can still use the SAS Language to manipulate huge volumes of data, in-memory and in parallel.
From a data management standpoint, the SAS Language provides different ways to manipulate data residing in CAS.
The 4 main (data processing) language components that leverage CAS efficiently are:
Let’s walk through those language components.
DATA Step
DATA step is the core of SAS Language. And good news, it is CAS-enabled, and provides much more features than DATA step in the LASR or Hadoop worlds.
“People thought that it couldn't be done and we've finally parallelized the DATA step across all of your data.” -- Paul Kent at Strata+Hadoop 2016 (YouTube)
In SAS Viya, the DATA step runs in one of two environments:
When the DATA step runs in CAS, it runs either in a single thread or in multiple threads:
For a DATA step to run in CAS, the following must be true:
Not all SAS language elements are supported in CAS. Language elements that are supported in CAS are a subset of the SAS language. If you use a language element that is not supported in the CAS DATA step, then the DATA step will automatically run in SAS rather than generate an error.
In addition, some DATA step language elements that rely on inter-row dependent operations might return unexpected results when they are used in a multi-threaded context (RETAIN, LAG, DIF, etc.). Some others are not supported (POINT=, OBS= for example).
A DATA step can be submitted in CAS using one of the following methods:
Link to the documentation: Data Step in CAS
Example:
libname libnir cas caslib=casnir ;
data libnir.result(copies=0 replace=yes) ;
merge libnir.megacorp_500m(in=a) libnir.right2(rename=(idr=id)) ;
by id ;
if a ;
run ;
This DATA step will generate the following message in the log indicating that it runs in CAS:
DS2
DS2 is a SAS proprietary programming language that is appropriate for advanced data manipulation. It also includes additional data types, ANSI SQL types, programming structure elements, and user-defined methods and packages. The DS2 language and the DATA step share core features. In addition, DS2 is intended to run in SAS as well as in other third-parties that support the SAS Embedded Process like Hadoop, Teradata and some others.
CAS is a perfect environment for DS2 since the THREAD / DATA program paradigm elegantly maps two stage map/reduce type logic for you on the grid. THREAD program logic runs wide on all workers while any logic in the DATA program runs on one worker gathering data from all the THREAD programs.
A DS2 program is classified as either a serial program, parallel program, or a parallel-serial program based on the type of programs and type of data manipulation operations that it possesses.
Unlike the DATA step, you don’t really need to care about what language element is supported or not in CAS. With DS2, you only specify if it runs entirely in CAS or not by using the SESSREF (or SESSUUID) option, and almost all DS2 language elements are supported in CAS, with only a few exceptions (some random functions like RANUNI, RANCAU, etc. that are not supported in DATA step either). Consequently, if you run DS2 in CAS, you only deal with CAS tables. There is no possibility to combine CAS tables with SAS tables.
DS2 can be submitted in CAS using one of the following methods:
Link to documentation: DS2 in CAS
Example:
caslib casnir PATH="/opt/sas/data/" TYPE=path SESSREF=session1 ;
/* load tables in CAS before using DS2... */
proc ds2 sessref=session1 ;
thread join_th / overwrite=yes ;
method init() ;
put 'thread:' _nthreads_= ;
end ;
method run() ;
merge casnir.megacorp_500m(in=a) casnir.right2(in=b rename=(idr=id)) ;
by id ;
if a and b ;
end ;
endthread ;
data casnir.result(replication=0) / overwrite=yes ;
dcl thread join_th t ;
method run() ;
set from t threads=2 ;
end ;
enddata ;
run ;
quit ;
This DS2 procedure will generate the following message in the log indicating that it runs in CAS:
FedSQL
SAS FedSQL is a SAS proprietary implementation of ANSI SQL:1999 core standard. It is the only way for running SQL in CAS.
The FEDSQL procedure in CAS only supports 3 SQL statements:
Like the DS2 language, you specify if FedSQL is running entirely in CAS or not by using the SESSREF (or SESSUUID) option. Consequently, if you run FedSQL in CAS, you only deal with CAS tables. There is no possibility to join CAS tables with SAS tables. In addition, the current version of FedSQL working in CAS only supports CAS tables and no other data source.
FedSQL can be submitted in CAS using one of the following methods:
Link to documentation: FedSQL in CAS
Example:
caslib casnir PATH="/opt/sas/data/" TYPE=path SESSREF=session1 ;
/* load tables in CAS before using FedSQL... */
proc fedsql sessref=session1 _method _cost ;
create table casnir.joinResult {options replication=0 replace=true} as
select * from casnir.megacorp_500m as a left join casnir.right2 as b on a.id=b.idr ;
quit ;
This FEDSQL procedure will generate the following message in the log indicating that it runs in CAS:
Regarding join operations, FEDSQL provides 3 types of algorithms:
One can know the join algorithm chosen by FEDSQL by specifying the _METHOD option like in the example above.
Usually, hash or merge joins are chosen by the FEDSQL optimizer when equality joins are requested. The hash/merge decision depends generally on the cardinality of the key variables, on the join type (inner, left, etc.) and/or on the table size.
Nested loop is a last resort when hash or merge cannot be done. For example if the join condition is an inequality or perhaps contains a complex SQL expression.
Transpose
Transpose is one of the favorites operations used by data scientists to prepare their data before designing predictive models using SAS Analytics. Historically, transpose has been one of the most difficult operation to push outside of SAS, using in-database capabilities. It is now possible with 9.4M4 (pre-production in 9.4M3) in Teradata and Hadoop.
That said, the transpose capability was a “must-have” operation in CAS. And fortunately the TRANSPOSE procedure is CAS-enabled. To run a transpose operation in CAS, source and target tables must be CAS tables. Also some statements are required and some others are not supported:
Transpose can be submitted in CAS using one of the following methods:
Link to documentation: Transpose in CAS
Example:
libname libnir cas caslib=casnir ;
proc transpose data=libnir.megacorp_500m out=libnir.megacorp_500m_tr(copies=0) ;
by id ;
id facilityregion facilitystate ;
var revenue ;
run ;
This TRANSPOSE procedure will generate the following message in the log indicating that it runs in CAS:
Conclusion
CAS is not only the new in-memory analytics engine for SAS, it also provides great in-memory and parallel data manipulation features.
There is also a new procedure, and a new language extension, that we haven’t talked about in this blog, and that allow users to perform data manipulation in CAS. It is CASL and the CAS procedure. CASL is the language specification that enables you to access the CAS server. CASL is an integral part of the CAS procedure.
Basically, CASL and the CAS procedure enable users to run CAS actions. Instead of running a DATA step, a DS2 procedure, a FEDSQL procedure or a TRANSPOSE procedure, one can run a CAS procedure to call CAS actions that perform the same operation (dataStep.runCode for DATA step, ds2.runDS2 for DS2, fedSql.execDirect for FedSQL and transpose.transpose for transposing data).
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.