SAS In-Database technology is a huge step forward in terms of large-scale analytics processing. After we deploy a piece of software known as the SAS Embedded Process (or EP for short) into a third-party vendor's database provider, then we can direct it to perform all kinds of useful functions on the data right where it lives, eliminating the need to move tons of data around. Depending on the features we support for a given third-party data provider, then we can perform in-database activities such as data filtering, sorting, ranking, scoring, select Base SAS and SAS/STAT procedures, and more.
The EP's ability to perform these kinds of tasks depends on the feature set provided by the third-party database - and some offer much more that we can leverage than others. Let's look at where you can find this information.
In order to enable the various in-database processing features, we require a specific mix of SAS software products along with supported data providers. The EP can perform some tasks in some environments, but not in others. And it is also important to keep track of software versions as well. SAS has been actively developing the in-database technology over time with new features introduced regularly.
The latest version of all SAS In-Database Technology documents are tracked on a the SAS Support web site at: http://support.sas.com/documentation/onlinedoc/indbtech/. Whenever you need to look up an in-database document, we highly recommend starting at that page to ensure you've got the latest version.
For a description of SAS In-Database Processing features, refer to the SAS® 9.4 In-Database Products: User's Guide currently in its sixth edition. In the Introduction section of the INDBUG document, there's an easy-to-reference table listing which in-database features are available with required SAS software on supported database providers.
Figure 1. A portion of the In-Database Features list with required products.
This [direct link] will take you straight there to the full listing of features.
SAS In-Database technology is very powerful and convenient - but it's limited in large part by the capabilities available in the third-party data provider. There are times when you want to employ more advanced analytics that only SAS can provide. So the SAS Embedded Process has another feature of particular value when also utilizing our SAS In-Memory offerings: it can parallel load data from each node of the database provider to each of the nodes of your target SAS in-memory analytics service.
SAS In-Memory offerings are supported on the backend by the SAS High-Performance Analytics Environment (HPAE) software. HPAE is one of the advanced technologies at SAS which is built using tkgrid technology. HPAE takes full advantage of massively parallel processing (MPP) architectures. It allows us to run analysis on huge volumes of data faster than ever by distributing data to RAM and the analytic processing across many machines. So when we use the Distributed SAS LASR Analytic Server or when we run a high-performance statistics PROC, we're instantiating an instance of HPAE on many servers to do the job.
The latest version of all SAS High-Performance Analytics Infrastructure documents are tracked on the SAS Support web site at: https://support.sas.com/documentation/solutions/hpainfrastructure/index.html. Whenever you need to look up an administration or configurations documents for SAS High-Performance Analytics software, we highly recommend starting at that page to ensure you've got the latest version.
For more information about configuration in support of parallel loading to LASR, refer to the SAS® High-Performance Analytics Infrastructure: Installation and Configuration Guide currently in as version 3.1. In chapter 6 of the HPAICG document, you will find the following illustration explaining parallel load to LASR from the EP on supported database providers.
Figure 2. An illustration of parallel load from the EP listing supported data providers.
The ability to use the EP to stream data from each of the nodes of the data provider where it resides directly over to nodes of your HPAE service (e.g. LASR) without any other software intermediaries means we can load data as quickly as our network and server infrastructure will allow. When you're moving many terabytes of data, this adds up to massive time savings.
The In-Database documentation and the In-Memory documentation each tell their own piece of the overall SAS solution story. But we need to add our own observations to this story as well.
If we look at the INDBUG document, we will find that SAS provides an EP for in-database processing to the following data providers:
But looking in the HPAICG document, we see that only the following data providers can be used for parallel loading of data with the EP:
Summarizing this into a neat table you won't find anywhere else, we have:
Figure 3. A summary table of DBMS for which SAS provides in-database functionality and their ability to load parallel data to SAS HPAE in-memory services (e.g. LASR)
When it comes to Hadoop, there's one more dimension to consider: which distributions/versions of Hadoop can be used with which SAS features? You can find that information in detail on the SAS Support site at: https://support.sas.com/resources/thirdpartysupport/v94/hadoop/hadoop-distributions.html.
The moral of the story? Just because you have an EP doesn't mean that it can parallel load data to LASR! We're looking at you, Aster, DB2, Netezza, and SPD Server! And for "alternative" Hadoop distros, like Apache's reference implementation of Hadoop, it might work, but you're on your own for technical support where there are differences between that and the nearest reference version employed by SAS Technical Support.
SAS Visual Analytics is our flagship product for realizing the power of in-memory processing. It relies exclusively on the SAS LASR Analytic Server for all statistics processing. VA also provides utilities with which we can load data into LASR. Of course, if the EP is available in a supported data provider, then we want VA to instruct the services to perform a parallel load of LASR.
Recently, a question arose from the field about support for Oracle DBMS with VA. When deploying the software, the consultant was surprised to find that Oracle was not a selection in the prompt for "SAS Visual Analytics High-Performance Data Provider":
Figure 4. The SDW prompt for data provider where "Oracle" is not a selectable option.
Based on what we know from the INDBUG and HPAICG documents, we were all a bit surprised by this omission. But GOOD NEWS! After following up with the product management and R&D teams in support of VA, we were able to confirm that VA does work with the SAS EP deployed to an Oracle DBMS and further that it will indeed direct parallel loading of data to LASR.
To enable parallel loading from supported data providers, follow the instructions provided in the HPAICG to configure remote EP information to the HPAE in-memory service.
Rob Collum is a Principal Technical Architect with the Global Architecture & Technology Enablement team in SAS Consulting. When he’s not embedding his processes in some DBMS, he enjoys sampling coffee and m&m’s from SAS campuses around the world.