With SAS Viya 2020.1 and later, we are introducing a major paradigm shift in how we package and deliver SAS software. In particular, SAS is delivering SAS Viya software in Docker container images for deployment into Kubernetes-managed clusters. While there's a steep learning curve to getting onboard with this new approach, the benefits are manifold in providing the ability to rapidly setup, deploy, scale, monitor, and manage Viya deployments.
But let's focus on the containerization aspect here. Containers can be thought of as small virtual machines which have been stripped down to their core essence for running the apps they deliver - everything you need, nothing you don't. There's a base OS, some required extensions, and of course, the software you're including.
SAS Cloud Analytic Services in SAS Viya will be delivered as Docker container images for both SMP operation (single host) and MPP operation (multiple hosts, controllers and workers). That decision has implications for CAS' access to a long-time data source. Let's see what impact this has.
SAS Cloud Analytic Services is the third major iteration of our in-memory analytics technology. Prior to CAS we had SAS High Performance Analytics Server and then the SAS LASR Analytics Server. LASR in particular introduced the SASHDAT data format. SASHDAT is a proprietary data structure optimized for distributed analytic workloads. When SASHDAT was first introduced, it required a Distributed LASR Server (a.k.a. MPP operation) and symmetrically co-located Apache HDFS storage.
LASR employed a nifty trick working with HDFS known as short-circuit local reads. In essence, LASR went beyond the standard HDFS client APIs to get to SASHDAT on disk. You might recall we also installed some secondary software in Hadoop known as the SAS Plugins for Hadoop. LASR used those plugins to access the blocks of SASHDAT on the HDFS disks directly. Bypassing the network-based client APIs provided a marked improvement in I/O performance - and led to SASHDAT being touted as the fastest, most efficient way to load data into LASR.
In Viya 3, CAS inherited those LASR capabilities and took them further, including the ability to work with SASHDAT files on remote HDFS. This offered a lot more flexibility to our customers to leverage their existing Hadoop in support of SAS data. However, the SAS Plugins for Hadoop were still required in the target HDFS. That's because CAS would initiate SSH connections to the HDFS hosts to fire up those plugins - which used those HDFS short-circuit reads - to bypass the slower HDFS client API and access the SASHDAT data in HDFS as fast as possible.
In Viya 3, CAS also has the ability to work with SASHDAT on standard, POSIX-compliant file systems and can access that data in serial (either SMP or MPP operation, using PATH caslib) or in parallel (MPP operation, using DNFS caslib). CAS can even work with SASHDAT files in cloud object storage relying on the S3 protocol. (Lots more info)
And as time has moved on, Hadoop's prominence as a low-cost, robust storage solution is waning as competitive technologies emerge and mature.
The proven practice in building container images is to aggressively remove any elements not absolutely necessary for their operation. Containers are meant to be reasonably small and portable. Avoiding unneeded components keeps things light and tidy as well as reduces the complexity which decreases the chances of unexpected problems. Another aspect this improves is security by eliminating known (and unknown) communication channels.
The container images which deliver the SAS Viya software will not include any SSH software, neither client nor server. Eliminating the SSH server reduces the attack surface of the container - one less service listening at a port. With Docker and Kubernetes, it's still possible to gain command-line access to the containers through secure channels so an SSH server isn't needed anyway. Since none of the SAS Viya containers will run SSH servers, then we don't need SSH clients for them to communicate with each other either. Remember, communication between Viya services is RESTful HTTP, usually with TLS encryption.
And this is probably obvious, but SAS will not include any Apache Hadoop software in our Viya container images either.
In SAS Viya 2020.1 and later, CAS is effectively cut off from working directly with SASHDAT in HDFS.
For remote Hadoop, CAS no longer will have the SSH ability to directly fire up the SAS Plugins for Hadoop to perform the HDFS short-circuit reads used to access SASHDAT in HDFS. And due to the design approach of containerization, the ability to deploy HDFS symmetrically alongside CAS on the same "machine" (a.k.a. co-located) isn't possible so there is no local HDFS anymore either. CAS never relied on the standard HDFS client APIs for access to SASHDAT which means that's also not a viable route.
This is a remarkable turn of events. SASHDAT was first introduced only on HDFS and many of our SAS 9 and SAS Viya 3.x customers will have large data volumes stored there. How do they get to them with CAS in SAS Viya?
In SAS Viya 2020.1 and later, CAS will need help to access SASHDAT in HDFS. A broker. A middleman. Some kind of intermediary.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Let's start with what created the SASHDAT files in the first place.
A distributed SAS 9.4 LASR Server (MPP) can access SASHDAT in co-located HDFS. CAS offers a built-in platform data connector to communicate with LASR directly which is surfaced by the LASR-type of caslib. With the LASR caslib, CAS can read in-memory data out of LASR directly (but not write back). It's flexible to perform either serial or parallel (default) movement of the data to ensure best transfer. However, with Viya 4, there's a new requirement for this to work with CAS: the tables in LASR must have a signer so CAS can check permissions against the SAS LASR Authorization Service. In SAS Viya 3, this was optional because, for LASR tables without a signer, CAS could SSH to the LASR host and check permissions directly. This is no more in SAS Viya 2020.1. Of course, SAS 9 could also come into play by exporting data from LASR and saving it in standard SAS7BDAT format (or elsewhere). And guess what - CAS has the ability to read SAS7BDAT from disk in parallel, too - so it can also load quite quickly into memory as well.
CAS in SAS Viya 3.x can access SASHDAT in HDFS. Besides its own modern SASHDAT format, CAS in SAS Viya 3.x can also read (only) the legacy SASHDAT files in HDFS produced by LASR (assuming all column sizes are compatible for transcoding to UTF-8 **and** that the SAS Viya Plugins for Hadoop are deployed - they're backward-compatible with LASR). There are some exceptions, but this works for most LASR tables in HDFS. This means that CAS in SAS Viya 3.x can read the data out of SASHDAT files on HDFS and then save it somewhere else (in modern SASHDAT or some other format) so that CAS in SAS Viya 2020.1 is able to access it as well. See Gerry Nelson's post Getting your data from LASR to CAS for more information.
Hasn't it been weird that CAS offers a direct-to-LASR caslib, but not a CAS-to-CAS caslib to transfer in-memory data from one CAS server to another? Well, good news! A CAS-type of caslib is coming soon.
We don't have a ship-date yet, but SAS is working on a new caslib to enable one CAS to read data from another CAS. And it will work for CAS in SAS Viya 2020.1 to access data from CAS in SAS Viya 3.x using either serial or parallel movement. We'll have more details when this feature is closer to shipping.
I for one was surprised to learn that CAS in SAS Viya 2020.1 and later won't be able to directly access SASHDAT files in HDFS - the very first place they were ever stored and the only(!) place that SAS LASR Analytics Server could use. A lot of software design comes down to trade offs, especially as technology matures over time. Consider Hadoop's slide as cloud-based object storage increases in popularity. Indeed SAS already provides CAS with the ability to access its SASHDAT files in ADLS and S3 (and soon GCS as well).
Keeping SSH and Hadoop out of the SAS Viya containers will have long-term positive effect in keeping the containers optimized and improving their security. And the negative impact is minimal as alternatives to the classic approach are readily available.
The workarounds shown here for CAS in SAS Viya to reach SASHDAT data in HDFS assume an upgrade situation. But of course, only SAS technology (specifically LASR and CAS) is capable of creating SASHDAT files in the first place. SAS offers several potential avenues where previous versions of SAS Viya 3.x and SAS 9 can move SASHDAT data out of HDFS to a destination accessible (and performant) for CAS in SAS Viya 2020.1 to work with.
Special thanks to Jason Secosky, Steve Krueger, Melissa Corn, and Gordon Keener in SAS R&D for their time, efforts, and expertise sharing information on this topic. --
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.