4 Approaches for Parallel Data Loading to CAS

2 Likes

When working with large amounts of data, it becomes increasingly important to improve the speed and efficiency of loading data to the in-memory analytics server. With the release of the SAS Viya platform and the CAS Server, we have two parallel data loading techniques carried over from SAS 9.4 and LASR as well as some all new technologies which promise tremendous capability and future potential.

Parallel vs. Serial

As a reminder, if parallel loading mechanisms are not available, then serial loading will likely take place. When serial loading occurs, the CAS Controller directs the incoming data stream such that table rows are distributed evenly to each of the CAS Workers.

While serial loading cannot offer the raw performance of parallel loading, it does provide support for the widest range of potential data sources, including any data source which Base SAS can reach.

Parallel loading takes place when data is loaded directly into each of the CAS Workers.

The distribution of data to the CAS Workers is usually driven by some aspect of the data source - but can be overridden by the CAS Controller when directed.

1. Parallel: SASHDAT on HDFS

If you've worked with the SAS LASR Analytic Server and the SAS 9.4 platform, then you're familiar with the SASHDAT format. For LASR, SASHDAT requires that the LASR Root and Worker Nodes are placed symmetrically alongside the HDFS NameNode and DataNodes, respectively. Data is lifted within each HDFS host from disk to RAM, without any network communication between nodes - effectively making this the fastest and most efficient way to (re-)load data into LASR.

For CAS on the Viya platform, some of this still holds true. But there are some notable changes:

The legacy SASHDAT file format from LASR stored in HDFS can be read by CAS assuming all of the column sizes are compatible for transcoding to UTF-8. If they're not, then you'll need to SAVE the data to the new CAS SASHDAT file format which will increase the columns' variable length and transcode to UTF-8 so that the CAS actions can then work with the data.
CAS can work with SASHDAT in HDFS even when asymmetrically placed with (i.e. on a subset of) the colocated Hadoop nodes - so we no longer need to license the CAS Server for all HDFS nodes of a Hadoop cluster to work with SASHDAT there.
CAS can even read and write its SASHDAT files to a remotely deployed instance of HDFS.

The symmetry of CAS colocation with HDFS still has an impact, though. When symmetric, data is loaded the same as for LASR - staying within each host and not requiring any network traffic. Furthermore, CAS will default to replicating the in-memory data (following the replication pattern on disk in HDFS) to provide failover functionality to ensure continued service even if a CAS Worker goes down.

On the other hand, when the CAS deployment is asymmetrically or remotely to HDFS, then blocks of data for the table must be copied over the network using SSH from the HDFS nodes where no CAS Workers are running. And because CAS may be running on only a subset of nodes, the default behavior is not to replicate in-memory blocks - which means that there is no failover protection by default (see the SAS Viya Deployment Guide). Right now, using the using the COPIES argument for the LOAD statement in the CASUTIL procedure won't really help - it just creates memory-mapped blocks to both HDFS and the CAS_DISK_CACHE location which CAS failover doesn't handle (yet).

2. Parallel: Data Connect Accelerators

Again, comparing back to the SAS 9.4 platform, we could use the SAS Embedded Process in supported remote data providers to peform parallel loading of data directly to the LASR Workers.

For the Viya platform, we can do the same thing with the EP by also employing the SAS Data Connect Accelerators in CAS.

In this illustration, we're showing the EP deployed to a Hadoop cluster. This allows parallel data loading to occur from supported file types stored in HDFS. The distribution of data to the CAS Server in this scenario is managed by the EP. It works out the ratios so that it evenly distributes blocks (not rows) of data to each of the CAS Workers.

Teradata is another supported provider for which we have Data Connect Accelerators available - and more will be forthcoming in the future. Development of new Data Connect Accelerators as well as expanding the capabilities of existing ones is constantly underway.

3. Parallel (and Serial): DNFS

And now we're on to a new parallel loading technology which is introduced with the Viya platform: DNFS. The acronym references the concept of "distributed network file system" - but it's not limited just to NFS-style solutions. Regardless of how the physical disk is attached to each of the CAS Workers, the point is that all CAS Workers utilize the same identical directory path to get to data which is stored in a commonly-accessible location (NAS, SAN, JBOD, CFS, etc.).

Data stored using DNFS is also kept as SASHDAT, but embedded inside a special file container known as SASDNFS. Even though the container is SASDNFS, the file type you'll see saved to disk is still ".sashdat" - but to be clear, this cannot be straight-copied into HDFS for use as SASHDAT there via the HDFS caslib.

The really cool thing about the SASDNFS container format is that is also works with the PATH caslib type. When accessing SASDNFS files, the PATH caslib is for serial loading only (usually) - either for an SMP CAS Server or by the MPP CAS Controller (for subsequent distribution to each of the CAS Workers). And so this means that SASDNFS files can be copied using basic operating system commands - like cp, scp, rsync, etc. - to move/duplicate the data on disk between hosts of SMP and MPP CAS Servers.

4. Parallel (and Serial): Base SAS Engine Data Sets

Yes, that's right! CAS has the ability to load SAS7BDAT files in parallel! For this to work, you need to:

Place the SAS dataset on a shared filesystem mounted on all CAS Workers;
Ensure that all CAS Workers have direct access to the SAS7BDAT file at the same physical path;
Specify a caslib datasource with srctype="path" and dataTransferMode="parallel" or "auto".

If only the CAS Controller has direct access to the SAS dataset -and- if "auto" is the value for the dataTransferMode parameter, then it can fall back to performing a serial transfer if the CAS Workers don't have the necessary direct access. Otherwise, you can employ the more conventional route of using SAS to read the dataset and send it over the network to the CAS Controller.

The future is here

As we continue to see, CAS builds on the hard-won lessons gained from LASR and HPA Server. DNFS is a particularly exciting development as it provides the ability for us to implement parallel loading techniques without relying on coordinated integration with external data providers. The speed of DNFS will be proportional to the investment your site makes in the shared disk technology - don't expect miracles from a single spindle surfaced over basic NFS. The point is that SAS customers can employ high-performance disk solutions which they're already familiar with to enable high-speed parallel data loading with the DNFS caslib.

A word of thanks

I just wanted to give a quick hat tip of appreciation to Brian Bowman, Steve Krueger, Mark Gass, and Zack Helms for providing many of the details which make this blog post useful.

You guys rock!

--

Rob Collum is a Principal Technical Architect with the Global Architecture & Technology Enablement team in SAS Consulting. In between parallel data loading runs, he also enjoys sampling coffee and m&m’s from SAS campuses around the world.

robstockray · ‎02-22-2018

This is very useful and important information for planning a CAS deployment. The installation documentation should include this kind of concise information to help understand the storage options and trade offs.