SAS software utilizes disk in a myriad of ways. And providing the appropriate kind of disk storage for SAS is not a one-size-fits-all approach. We've already looked at disk I/O throughput, but there are other factors, too. So, in this post let's talk about how shared file systems can work with SAS.
A shared file system, at its simplest, provides access to a single set of files (and directories) to multiple computer hosts. There's a balance we need to understand to guide customers properly about how SAS uses shared file system technology.
File systems are typically provided in the software layer on top of underlying hardware. This means there is a really wide range of technologies which I cannot adequately describe in this space. There are simple and cheap approaches to implementing a shared file system as well as expensive and complex ones. Whenever performance, robustness, resilience, and availability are crucial factors, then more expensive shared file system technology is often the best choice.
When you're ready for more details about shared file system technologies, I recommend reading Margaret Crevar's Shared File Systems: Determining the Best Choice for your Distributed SAS® Foundation Applications paper.
In the meantime, I will oversimplify these concepts and look at several approaches to hosting files for SAS.
NFS is a distributed file system communication protocol which allows computers to access files over the network. A simple scenario might be where you have files on one server that you'd like to access from other servers. Instead of copying the files to each machine directly, set up NFS so that the client computers can directly access the files on the server computer.
The main thing to understand about NFS is that data is moved across standard network connections (like Ethernet on the corporate LAN). If a substantial amount of data is transferred frequently, then that traffic can compete with other activity and even saturate the network altogether.
NFS is simple, easy to set up, and supported by most major operating systems (and Hadoop distros, too). But it's not very robust in the face of multiple, heavy concurrent activity.
For SAS:
A NAS is typically an external appliance which acts as a central file server for the environment. Files are accessed from client computers using the NFS protocol (as well as others). The benefit to this approach is that the NAS appliance is dedicated to the job of sharing files and can be optimized for your needs. Because it relies on the NFS protocol however, data is still transferred using standard network connections.
There are a large number of hardware vendors which offer NAS appliances. One we see often with our SAS customers is EMC Isilon. Isilon is very popular, but there are still challenges that can make NFS-based storage solutions challenging for SAS to work with when dealing with large amounts of data (see Advisory Regarding SAS® Grid Manager with Isilon document from EMC).
For SAS:
A SAN is another kind of external appliance for storing files and data for multiple computers in the environment. Instead of clients relying on standard network connections to access their files, computers have dedicated connections (HBA ports, Fibre channel cards, SAN controllers, etc.) to the SAN. This allows for a dedicated pathway to move data more efficiently so it doesn't compete with standard network traffic.
There are many SAN appliance vendors, too. IBM, EMC, Hitachi and many others offer a wide range of solutions. In general, SANs will outperform NAS solutions, but they usually come at a higher cost. Your customer will have an opinion about costs/benefits.
For SAS:
Often lumped in with the term "shared file system", we mean that a CFS is usually an additional software purchase used to extend the capabilities of the storage appliance. The clustered file system technology offers many benefits, including the ability to manage multiple concurrent read+write access to files and directories in the storage appliance. This is very useful in situations where SAS computing services may reside on multiple hosts in service of many users at the same time. With a CFS, then access to the data is provided to all hosts in a performant manner.
As with other solutions, many companies offer an implementation of CFS. One which we know works particularly well for SAS solutions is IBM Spectrum Scale (often referred by it's original name, GPFS).
For SAS:
Most compute servers come equipped with local disk which is storage space hosted inside the same enclosure, typically for dedicated use by that compute server. Local disk will be home to the operating system, temporary scratch space, software files, even data storage.
Historically, local disk was serviced by old-fashioned spinning magnetic hard drives. This made it difficult to configure and provide SAS with the I/O throughput it needs for efficient performance. But modern technologies available today can provide local disk storage which is very, very fast.
For SAS:
SAS Viya offers some new architecture possibilities with implications for shared file systems we haven't seen with SAS 9.4. In particular, the SAS Cloud Analytic Server (CAS) has the ability to directly access data files from external sources or from disk. There are two areas in particular where CAS can capitalize on the availability of a shared file system which is mounted to all CAS hosts so that it can perform fast and efficient parallel loading of data:
In general, any shared file system technology can be used to accomplish these tasks. However, when performance and resilience come into play for production enterprise systems, then cheap is often the enemy of good.
Sometimes its not just data files which need to be shared with multiple computers. It's also possible to share SAS software and configuration files in this way as well. One area where this is very helpful is the SAS Grid Manager solution.
With SAS Grid Manager, we typically assume a multi-tier deployment for backend services: meta, compute, and middle. The compute tier is of interest here. For a grid with 100 hosts providing the compute tier, it would be time-consuming to install and configure SAS solution software on each of those 100 hosts. Instead, if we create a shared file system for the SAS software and configuration, then we can deploy the compute tier components just once on the first host. The other 99 hosts will access the single set of SAS software and configuration files via the shared file system. After initial deployment, this also simplifies ongoing administration, management, hot fixes, and more.
SAS LASR Analytic Server also has an install-time option where you can direct the installer to place the LASR software files on one host with the expectation that they're placed in a shared file system available to other LASR hosts which are participating together to act as a single cluster.
Shared file systems are very useful. But they aren't a silver bullet for all problems. In fact, there are some places where you should take care to avoid using them.
Not for CAS Software Files:
Not for CAS_DISK_CACHE:
Not for Sharing SAS 9.4 Deployment Files Across Meta and Middle Tiers:
Shared file systems are an important and necessary technology which enable SAS solutions to scale effectively and operate efficiently. Finding a one-size-fits-all approach is unlikely when weighing considerations around capabilities and costs. Planning and collaboration with your IT organization is necessary to ensure that SAS can perform as designed.
no mention of SMBs here. Do you have best practices on how to mount SMBs in Linux SAS application server?
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.