Seriously Serial or Perfectly Parallel Data Transfer with CAS

4 Likes

SAS Cloud Analytic Services is the engine of our in-memory analytics offerings. It promises to deliver massively scalable performance of huge volumes of data. But then getting all of that data loaded into CAS (or saved back out later) is another scalability challenge to tackle. CAS addresses that challenge by providing support for a wide range of data sources starting with native SAS formats and extending out to include direct loading from many third-party data providers. Along the way, we also need to carefully plan and evaluate the architecture and infrastructure to ensure everything is moving along as fast and efficiently as intended.

As of Viya 3.3, the language we use to direct CAS in its efforts to transfer data has evolved considerably. We may explicitly configure CAS to use a serial data transfer method, but the objective we're shooting for is effectively parallel. Alternatively, we may set up CAS and other related components for a fully parallel transfer, but then CAS may deem it necessary to downgrade to a serial approach.

So let's crack open this case and look at determining exactly what kind of data transfer we're asking CAS to perform, verifying that's what it did, and understanding why it worked.

CAS DEPLOYMENT

The CAS Server can be deployed in two modes: SMP and MPP. SMP is when the entire CAS Server runs on a single host machine. MPP is when the CAS Controller(s) and Workers are each deployed to separate hosts in which they all work together to provide a single logical CAS Server.

Typically, MPP CAS is our default topology when planning for parallel data transfer. The idea is that the CAS Workers each have their own I/O connection to the data which ideally multiplies the rate at which data is loaded into CAS.

But keep in mind that there are other environmental factors which will certainly have a bearing as to what kind of data transfer I/O performance you get. Just because we have multiple CAS Workers querying a source for data doesn't necessarily mean we'll get that multiplied I/O rate. If the infrastructure hosting the data is effectively a single point-of-contact, then we may still get throughput rates expected when working with SMP CAS.

And don't knock SMP CAS. The use of SMP CAS doesn't relegate us only to slow data transfer. With the appropriate hardware, loading data into SMP CAS can be very fast as well. Ultimately this all comes down to the infrastructure available, software that's licensed, and the timing requirements of the project.

NATIVE CASLIB TYPES

There are several different types of caslibs we can use for bringing data into CAS. The choice of which to use depends on the data source, storage format, and connection topology.

➤ Path caslib - not always serial:

The path type of caslib is typically used to reference data on disk local to the CAS Controller. The Controller performs a serial data transfer, reading the data from the location specified and then distributing that data to each of the CAS Workers.

caslib mypath sessref=mysess1

              datasource=(srctype='path',

              path="/path/to/data");

Path caslibs can be used to read several file types and transfer them serially to CAS. But for some of those filetypes, we can specify optional parameters which will enable parallel transfer of data to CAS such that the CAS Controller will coordinate and direct the CAS Workers to grab their specific allotment of rows from disk.

Path caslib Filetype	Mode	Parallel Path caslib options

`SASHDAT`	Serial only	none
`SAS7BDAT`	Serial or Parallel	`dataTransferMode='parallel'`
`CSV`	Serial only	none
`XLS`, `XLSX`	Serial only	none

In order for CAS to perform a parallel operation on SAS data sets (or CSV files) using the path type of caslib, then all CAS hosts (not just the CAS Controller) must have the directory path specified in the caslib mounted to reference a common shared filesystem.

Interesting to note that the SASHDAT filetype - specifically designed for parallel, distributed, high-performance data access cannot participate in parallel transfer using the path type of caslib. To get that, we need to look at DNFS.

➤ DNFS caslib - usually parallel:

The DNFS type of caslib is a new parallel data transfer technology introduced with CAS for accessing data using conventional disk storage solutions. The idea with DNFS is that a shared filesystem is mounted to identical directory paths on the CAS Controller and all of the CAS Workers. The CAS Controller can then coordinate and direct the CAS Workers to grab their specific allotment of rows from disk.

caslib mydnfs sessref=mysess1

              datasource=(srctype='dnfs',

              path="/path/to/data");

Notice that the directory "/path/to/data" is assumed to exist identically on all CAS hosts and is expected to represent a single shared filesystem with data for CAS.

DNFS caslib Filetype	Mode	Serial DNFS caslib options

`SASHDAT`	Parallel only	none
`CSV`	Parallel only	none

There's no way to specify a serial-only data transfer for the DNFS type of caslib. The dataTransferMode caslib option isn't recognized.

But just because CAS will attempt a parallel transfer, that doesn’t mean it'll actually happen. For example, if the filesystem backing DNFS is just plain old NFS sharing a single hard disk drive which is hosting our SASHDAT file, then we're effectively stuck with a serial process at the physical level. So bear in mind there's only so much SAS software can be responsible for in terms of I/O performance - at some point where the rubber meets the road, we need to ensure that our physical environment is providing the I/O throughput that our solution will require for adequate performance.

➤ HDFS caslib - perfectly parallel:

Like LASR before it, CAS can access data directly from Hadoop's Distributed File System (HDFS). With the HDFS caslib, the CAS Controller coordinates and directs the CAS Workers to grab their specific allotment of blocks from HDFS.

caslib myhdfs sessref=mysess1

              datasource=(srctype='hdfs',

              path="/hdfs-path/to/data");

HDFS caslib Filetype	Mode	Serial HDFS caslib options

`SASHDAT`	Parallel only	none
`CSV`	Parallel only	none

Like DNFS, there's no programming option to specify a serial-only transfer of data from HDFS. And due to the nature of HDFS, it's unlikely we'll hit a serial-only aspect of the environment. Our biggest concern might be the potential for network or process bottleneck for some remote/asymmetric deployments, but as a rule, we should see multiple, parallel I/O channels employed when CAS is working with HDFS caslibs.

EXTERNAL DATA SOURCES

CAS can also work with non-native data from external data providers. The data we want for CAS is stored in a non-SAS format, usually in a remote system. To get to that data, we need specific software components.

➤ SAS Data Connector - serial, except when it's parallel:

CAS can communicate with external data providers by implementing the appropriate SAS Data Connector product. We offer Data Connectors for Hadoop, Impala, Oracle, PostgreSQL, ODBC, PC Files, and more. The default mode of operation employed by CAS for SAS Data Connectors is serial transfer such that the CAS Controller connects to the data source, queries the data, and distributes the results to the CAS Workers.

caslib myhive sessref=mysess1

              datasource=(srctype='hadoop',

              dataTransferMode='serial',

              server="hive.site.com",

              username="myuserid",

              schema="myschema" ... );

Note that for caslib source types to external data providers, dataTransferMode='serial' is the default assumption and it directs CAS to use the SAS Data Connector associated with the data source type.

But "serial" isn't always serial. When using SAS Data Connectors, we can direct CAS to perform a multi-node transfer. The CAS Controller can then coordinate and direct the CAS Workers to grab their specific allotment of rows from the data provider.

External Data Source	Mode	Multi-node PROC CASUTIL options

ODBC	Serial only	none
PC Files	Serial only	none
Amazon RedShift	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
DB2 for UNIX	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
Hadoop	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
Impala	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
Microsoft SQL Server	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
Oracle	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
PostgreSQL	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
SAP HANA	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
SAS Base Engine Data Sets	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
SAS SPDE Files	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`
Teradata	Serial or Multi-node	`NumReadNodes=0 (or >1) NumWriteNodes=0 (or >1)`

For numREAD/WRITEnodes, a value of zero (0) directs CAS to perform multi-node data transfer where all CAS Workers query the data provider for their allotment of data. A value greater than one (>1) directs CAS to have that many CAS Workers participate in the multi-node action. And the default value of one (1) requests the standard serial transfer.

Even with parameters directing CAS to attempt multi-node transfer, it's possible the resulting action will be performed serially. For example, if the target table doesn't contain a suitable numeric variable as required for multi-node transfer, then CAS will automatically revert to serial transfer. CAS is resilient to many other kinds of considerations with regards to multi-node transfer, gracefully downgrading to the next functional level of service as appropriate.

➤ SAS Data Connect Accelerator - perfectly parallel:

SAS currently offers Data Connect Accelerator (DCA) products to work with Hadoop and Teradata. The DCA works in conjunction with our SAS In-Database technology in the form of the SAS Embedded Process (EP). The EP is a specialized software component which is deployed inside the remote data provider. It's primary purpose is to "take the analytics to the data" to reduce data movement and duplication. SAS offers data quality, scoring, and code acceleration functionality with our In-Database products.

To access the full power of products like SAS Visual Analytics, SAS Visual Statistics, and SAS Visual Data Mining and Machine Learning, data must be brought over into CAS. And the EP can perform that functionality as well. Each instance of the EP in a data provider can be directed to send its allotment of rows to all of the CAS Workers ensuring even distribution of the data. By default, this is a parallel transfer of data with all CAS Workers participating.

caslib myhive sessref=mysess1

              datasource=(srctype='hadoop',

              dataTransferMode='parallel',

              server="hive.site.com",

              username="myuserid",

              schema="myschema" ... );

Specifying the caslib option for dataTransferMode='parallel' directs CAS to use the SAS Data Connect Accelerator.

External DBMS	Mode	Serial DBMS caslib options

Hadoop	Parallel only	none
SPDE (only on Hadoop)	Parallel only	none
Teradata	Parallel only	none

PLAYING NICE TOGETHER

The dataTransferMode option for caslibs takes three values - and these may be the cause for confusion in determining if data is actually transferred in serial or in parallel to CAS. Those values are translated by CAS to drive its use of specific software products. Whether those products can deliver on the plain English meaning of the specified value is determined later.

dataTransferMode=	CAS will use	Transfer Mode

`'parallel'`	SAS Data Connect Accelerator	Only parallel
`'serial'`	SAS Data Connector	Serial or Multi-node if indicated. Automatically detect and gracefully fall back: Multi-node > Serial
`'auto'`	SAS Data Connect Accelerator. If fails, use SAS Data Connector.	Parallel if possible. Automatically detect and gracefully fall back: Parallel > Multi-node > Serial

When 'parallel' is specified as the dataTransferMode, that explicitly locks CAS to working only with the SAS Data Connect Accelerator and the SAS Embedded Process. If the EP cannot perform multiple channel I/O to the CAS Workers, then the transfer attempt is stopped with an error in the SAS log.

As you can see, in SAS' technical terminology, multi-node data transfer falls under the 'serial' label. However, the ideal multi-node configuration will indeed allow for multiple, parallel I/O channels for transferring data from the data provider directly to each of the CAS Workers. Real-life implementations, however, are not always ideal. CAS is programmed to verify multi-node capabilities and, if needed, it will reduce the number participating Workers to proceed with the transfer.

Use 'auto' as the dataTransferMode for the most flexibility. CAS will attempt to "do the right thing" by gracefully degrading from parallel to multi-node all the way down to serial if necessary.