Contemplating shared file systems for SAS

3 Likes

SAS software utilizes disk in a myriad of ways. And providing the appropriate kind of disk storage for SAS is not a one-size-fits-all approach. We've already looked at disk I/O throughput, but there are other factors, too. So, in this post let's talk about how shared file systems can work with SAS.

A shared file system, at its simplest, provides access to a single set of files (and directories) to multiple computer hosts. There's a balance we need to understand to guide customers properly about how SAS uses shared file system technology.

The Diversity of Shared File System Technologies

File systems are typically provided in the software layer on top of underlying hardware. This means there is a really wide range of technologies which I cannot adequately describe in this space. There are simple and cheap approaches to implementing a shared file system as well as expensive and complex ones. Whenever performance, robustness, resilience, and availability are crucial factors, then more expensive shared file system technology is often the best choice.

When you're ready for more details about shared file system technologies, I recommend reading Margaret Crevar's Shared File Systems: Determining the Best Choice for your Distributed SAS® Foundation Applications paper.

In the meantime, I will oversimplify these concepts and look at several approaches to hosting files for SAS.

NFS = Network File System

NFS is a distributed file system communication protocol which allows computers to access files over the network. A simple scenario might be where you have files on one server that you'd like to access from other servers. Instead of copying the files to each machine directly, set up NFS so that the client computers can directly access the files on the server computer.

The main thing to understand about NFS is that data is moved across standard network connections (like Ethernet on the corporate LAN). If a substantial amount of data is transferred frequently, then that traffic can compete with other activity and even saturate the network altogether.

NFS is simple, easy to set up, and supported by most major operating systems (and Hadoop distros, too). But it's not very robust in the face of multiple, heavy concurrent activity.

For SAS:

As a rule, avoid using a single computer as an NFS server for production data accessibility. Look to use a NAS at least when available
Single computer sharing disk via NFS may be suitable for:
- simple testing and possibly for dev/test environments where functionality testing (instead of performance testing) is the goal
- as the route to store metadata backups for a cluster of SAS 9.4 Metadata Servers
- providing access to a central SAS Software Depot.

NAS = Network-Attached Storage

A NAS is typically an external appliance which acts as a central file server for the environment. Files are accessed from client computers using the NFS protocol (as well as others). The benefit to this approach is that the NAS appliance is dedicated to the job of sharing files and can be optimized for your needs. Because it relies on the NFS protocol however, data is still transferred using standard network connections.

There are a large number of hardware vendors which offer NAS appliances. One we see often with our SAS customers is EMC Isilon. Isilon is very popular, but there are still challenges that can make NFS-based storage solutions challenging for SAS to work with when dealing with large amounts of data (see Advisory Regarding SAS® Grid Manager with Isilon document from EMC).

For SAS:

A NAS is far better suited to production enterprise use since it’s a dedicated appliance that can be managed
Read-only access to production data for SAS should perform reasonably well on properly configured NAS solutions
Monitor activity to ensure adequate performance and throughput on the network
Avoid using NAS for heavy duty read and write SAS operations - like SASWORK (and UTILLOC)
Ensure the underlying physical disk storage is dedicated to SAS - not striped to support other production software/databases concurrently

SAN = Storage Area Network

A SAN is another kind of external appliance for storing files and data for multiple computers in the environment. Instead of clients relying on standard network connections to access their files, computers have dedicated connections (HBA ports, Fibre channel cards, SAN controllers, etc.) to the SAN. This allows for a dedicated pathway to move data more efficiently so it doesn't compete with standard network traffic.

There are many SAN appliance vendors, too. IBM, EMC, Hitachi and many others offer a wide range of solutions. In general, SANs will outperform NAS solutions, but they usually come at a higher cost. Your customer will have an opinion about costs/benefits.

For SAS:

A SAN can be excellent for production enterprise use since it's a dedicated file-server appliance which transfers data using dedicated connections
Expect really good performance for SAS to read and write data, including SASWORK (and UTILLOC)
Ensure the underlying physical disk storage is dedicated to SAS - not striped to support other production software/databases concurrently

CFS = Clustered File System

Often lumped in with the term "shared file system", we mean that a CFS is usually an additional software purchase used to extend the capabilities of the storage appliance. The clustered file system technology offers many benefits, including the ability to manage multiple concurrent read+write access to files and directories in the storage appliance. This is very useful in situations where SAS computing services may reside on multiple hosts in service of many users at the same time. With a CFS, then access to the data is provided to all hosts in a performant manner.

As with other solutions, many companies offer an implementation of CFS. One which we know works particularly well for SAS solutions is IBM Spectrum Scale (often referred by it's original name, GPFS).

For SAS:

SAS requires a CFS when there are multiple instances of SAS server processes running across many hosts trying to access the same data simultaneously. For example, a SAS Grid Manager solution where users run many SAS programs simultaneously which access a central collection of SAS data sets kept in the storage appliance
While a CFS could be used for hosting all aspects of a SAS deployment, the layered costs of utilizing it should direct us to minimize its use to the areas where its particularly necessary.

Local disk

Most compute servers come equipped with local disk which is storage space hosted inside the same enclosure, typically for dedicated use by that compute server. Local disk will be home to the operating system, temporary scratch space, software files, even data storage.

Historically, local disk was serviced by old-fashioned spinning magnetic hard drives. This made it difficult to configure and provide SAS with the I/O throughput it needs for efficient performance. But modern technologies available today can provide local disk storage which is very, very fast.

For SAS:

Local disk is typically where the SAS 9.4 software and configuration files are kept
SASWORK (and UTILLOC) can be directed to use local disk as well, which reduces load on an external appliance
Remember that for SAS Checkpoint and Restart functionality provided by the SAS Grid Manager solution to work, then SASWORK (and UTILLOC) must be in shared storage, not on local disk.

Other Data Sharing Scenarios

SAS Viya offers some new architecture possibilities with implications for shared file systems we haven't seen with SAS 9.4. In particular, the SAS Cloud Analytic Server (CAS) has the ability to directly access data files from external sources or from disk. There are two areas in particular where CAS can capitalize on the availability of a shared file system which is mounted to all CAS hosts so that it can perform fast and efficient parallel loading of data:

DNFS = Distributed Network File System
When CAS is directed to use a caslib to access data with a srctype of "dnfs", then the CAS workers will each access their assigned portion of a SASHDAT file which is placed in the shared file system. This means a single copy of the data is accessible in parallel from all CAS workers
SAS data sets
CAS can also be directed to use a caslib of srctype "path" to access SAS data sets (or text-delimited files). If those data sets are placed in a shared file system mounted at the same location on all CAS hosts -and- the dataTransferMode parameter is set to "automatic" or "parallel", then CAS can perform a parallel load of that data.

In general, any shared file system technology can be used to accomplish these tasks. However, when performance and resilience come into play for production enterprise systems, then cheap is often the enemy of good.

Sharing SAS Software and Configuration Files Across Hosts

Sometimes its not just data files which need to be shared with multiple computers. It's also possible to share SAS software and configuration files in this way as well. One area where this is very helpful is the SAS Grid Manager solution.

With SAS Grid Manager, we typically assume a multi-tier deployment for backend services: meta, compute, and middle. The compute tier is of interest here. For a grid with 100 hosts providing the compute tier, it would be time-consuming to install and configure SAS solution software on each of those 100 hosts. Instead, if we create a shared file system for the SAS software and configuration, then we can deploy the compute tier components just once on the first host. The other 99 hosts will access the single set of SAS software and configuration files via the shared file system. After initial deployment, this also simplifies ongoing administration, management, hot fixes, and more.

SAS LASR Analytic Server also has an install-time option where you can direct the installer to place the LASR software files on one host with the expectation that they're placed in a shared file system available to other LASR hosts which are participating together to act as a single cluster.

When Not to Use Shared File Systems

Shared file systems are very useful. But they aren't a silver bullet for all problems. In fact, there are some places where you should take care to avoid using them.

Not for CAS Software Files:

Unlike LASR, CAS does not offer an option to access its software and configuration from a shared file system
Every CAS host expects to have its own local copy of its software and configuration files
SAS provides Ansible playbooks which handle this automatically as part of the deployment of SAS Viya software.

Not for CAS_DISK_CACHE:

CAS relies on disk available to each CAS host as a backing store for in-memory data. This provides the ability for CAS to behave resiliently when working with a failed node and provides additional data management flexibility
CAS creates files in the location for CAS_DISK_CACHE in a manner which conflicts with how shared file system technologies operate
Be careful to plan and deploy local disk for CAS_DISK_CACHE in all situations

Not for Sharing SAS 9.4 Deployment Files Across Meta and Middle Tiers:

The section above mentioned using a shared file system to host SAS 9.4 deployment files (software and configuration) for the SAS 9.4 compute tier software
However, the SAS 9.4 meta and middle tiers follow a different availability paradigm. We prefer that each host for the meta and middle tiers have their own local SAS software and configuration files
And even more importantly, never direct the SAS 9.4 Deployment Wizard to install one tier's software and configuration into the same physical directories for a different tier. In that way lies madness. 😉

In Conclusion

Shared file systems are an important and necessary technology which enable SAS solutions to scale effectively and operate efficiently. Finding a one-size-fits-all approach is unlikely when weighing considerations around capabilities and costs. Planning and collaboration with your IT organization is necessary to ensure that SAS can perform as designed.

mottycruz · ‎10-07-2019

no mention of SMBs here. Do you have best practices on how to mount SMBs in Linux SAS application server?