Contemplating I/O for SAS

8 Likes

The SAS platform - encompassing all aspects of our analytics and data processing, including Base SAS, Scalable Performance Data Server, LASR Analytic Server, as well as Viya and CAS - works directly with data which is stored on disk or in something like a third-party data provider's RDBMS. For example, Base SAS will read data sets from disk and then depending on the task at hand, write out and read in scratch files to SASWORK and UTILLOC during interim steps of processing. And even though CAS is an in-memory analytics engine - meaning it does all of its analytic processing in RAM and doesn't rely on disk for interim work - it still places data on disk for long-term storage, failover, and resiliency. In all cases, what we're talking about here is moving data from storage on disk to the CPU (and back!) where SAS can work with it. Planning the rate at which that data can move is an important step in a successful SAS deployment.

When dealing with a SAS deployment that seems slow to perform analytics processing on a system which has been sized appropriately with CPU and RAM, we usually find that the data's I/O throughput is a major factor. If SAS is waiting for data which is read from a conventional spinning hard disk drive with magnetic media (HDD), then the CPU will appear to be mostly idle. That's because the single HDD in this scenario usually can deliver data with a sustained I/O throughput of only 15-20 MB per second. In other words, the data is merely trickling in and the CPU spends more time in a waiting state than it does actually processing the data. One symptom to watch for is when the SAS log reports that the "real time" (a.k.a. wall clock time) is far greater than "cpu time" (i.e. when SAS is actually crunching numbers or otherwise doing real work).

real time 47.23 seconds

cpu time 0.89 seconds

To help alleviate this problem, the typical recommendation is to deploy a storage solution which is capable of transferring data to SAS at a rate of 100-150 MB per second per CPU core. This approach ensures that data can be delivered fast enough to keep the CPU more fully utilized as it processes the incoming data. That is, we'd like to see the machine constrained by the amount of work the CPU can do, not by the rate of disk I/O. So for a single host machine with 32 CPU cores, then:

32 CPU × 150 MB/sec/core = 4,800 MB/sec

To achieve that level of I/O throughput with conventional spinning disk drives, we may consider placing over 200 physical HDD in a storage cabinet, striping them to work together as a single volume, and dedicating the whole shebang for SAS' exclusive use - a very expensive proposition for many sites! Or alternatively, going with a distributed storage technology which is designed for that performance (also not cheap). Technology marches ever onward and so we might also consider using solid-state drives (SSD) instead. A single SSD can provide I/O throughput many times what HDD can deliver, in some cases, vendors advertise read/write speeds over 500 MB/sec. While costs have come down a lot in recent years, SSD are still more expensive than HDD of equivalent volume. But because we can use fewer SSD to achieve the same I/O throughput as many HDD, then we may find them to be more cost effective anyway when performance (not raw disk volume) is the goal.

There is no one-size-fits-all approach to provisioning disk storage for SAS technologies. We must take care to proactively identify the areas where data transfer will play a key role in the perceived performance of our SAS solutions. For example:

Base SAS makes extensive read+write use of the SASWORK location as scratch space - and for multi-threaded PROCs, UTILLOC (which can be optionally defined to a different location from SASWORK).
CAS doesn't need scratch disk that way, but it does rely on CAS_DISK_CACHE as a backing store for in-memory data. If the disks providing CAS_DISK_CACHE are slow, then that can affect CAS' perceived speed of data management as well.
LASR and CAS both read and write SASHDAT files with the storage systems they respectively support - notably Hadoop Distributed File System, when symmetrically co-located, is a storage technology they both can use. Provisioning disk to the DataNodes of HDFS follows its own rules.
SAS Scalable Performance Data Server is built to distribute data across multiple disks to achieve high performance I/O throughput.

While it would simplify things a lot if we could demand that our environment always provide storage for our SAS solutions which can feed all CPUs at a rate of 150 MB/sec, there are often times when that's not realistic. The majority of SAS deployments in the field are unable to achieve anywhere near that level of disk I/O performance. Often, the ideal is a solution which is able to perform at the level required and within budget for time and costs. This ideal, however, is far more difficult to quantify - it varies by site, by job, even by individual user. It's far easier for SAS to recommend the maximum throughput possible as that's something which helps provide a level playing field for building performant solutions. This ensures we all are making the same considerations and establishes a baseline for nominal performance. That said, some sites will simply do whatever they want. It's up to us as SAS experts to ensure that everyone is informed so that we can proceed together to achieve the ideal I/O performance desired.

Contemplating I/O for SAS

SAS Innovate 2025: Save the Date

Free course: Data Literacy Essentials

Get Started