How Analytic Workloads Work

2 Likes

SAS Enterprise Session Monitor Usage Series Part 2

This is the second post in a series for how to use SAS Enterprise Session Monitor. I’ve frequently referred people who were interested in learning practical information about Enterprise Session Monitor usage to two blog posts on the old ESM website. But as all good things on the internet do over time, the old host disappeared. So, I’ve reached out to the Enterprise Session Monitor team and received permission to bring the information back, in the hopes that you too will find it useful. -Erik Pearsall

Links

SAS Enterprise Session Monitor Usage Series Part 1

Unlocking SAS Performance with Enterprise Session Monitor Heatmaps

Obsessing over Observability

SAS Enterprise Session Monitor - Obsessing over Observability - SAS Users

ESM Deployment Using Ansible

SAS Enterprise Session Monitor – Deployment Using Ansible

Custom Process Monitoring with filterSpec in Enterprise Session Monitor

Taking Control: Custom Process Monitoring with filterSpec in Enterprise Session Monitor

Tag! You're it! Mastering ESM Tags.
Tag, You’re It!: Mastering ESM Tags

Analytic Workloads through the Lens of SAS Enterprise Session Monitor

A version of this article originally appeared on boemskats.com In July of 2020 as “How Analytic Workloads Work” by Nikola Markovic. This has been modified and reposted with permission from the Enterprise Session Monitor team.

This will be about fundamentals, as "real world" as possible. A big part of that will involve explaining how these fundamentals fit into the modern cloud landscape and affect performance and cost, including SAS Viya. But to get to that, and to truly appreciate it, it's worth starting from the basics.

The Basics: How Processors Process Data

Consider a program that looks like this.

data work.inventory;
infile '/some/data/inventory.csv';
 input someColumns : $someFormats.
/* more columns probably */ 
run;

This simple snippet of SAS code uses a SAS DATA step - one of the oldest and most widely used data load mechanisms in SAS - to read a text-based CSV dataset (/some/data/inventory.csv), and store it as a SAS dataset in the temporary WORK library (work.inventory). The WORK library is a bit like a temporary SAS scratch disk, used to store intermediate datasets between processing steps.

So, what is it that happens as this program runs? Consider:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

For this code to run, and for the machine to do the work we're asking it to do, the following things must happen:

The input data, that CSV file, needs to be read from storage - a disk somewhere.
That data then needs to be transported from that storage device to the CPU on which the code is running.
The CPU will receive the data in chunks and must process it. It needs to work out which segments of those chunks are the lines of the CSV file, which segments of each line are the actual columns of data, and how each of those column segments should be interpreted - whether the text between those two commas is a phone number, a timestamp, or someone's address.
Once it has worked all that out, its task is to write the resulting data out. In the case of this code, it will be writing the data out in a format it will find easier to understand next time round - the SAS binary data format (sas7bdat). That resulting data must then be transported from that CPU to a storage device somewhere where it can be written, to be read back again later.

These fundamentals are important. If you've ever run out of storage space on your laptop, you'll know that the amount of data a disk can hold is limited by its capacity. If you then tried to free up some space by moving some of the larger files to an external disk, you'll have discovered that the rate at which disks are able to read and write data is also very limited - by either the speed of the disk, or the speed of its connection to your laptop.

It's a common assumption that the speed and age of a machine's processor dictate its performance. While it's true that a faster, newer CPU can execute more instructions than an older one, if those instructions consist of 'process this data', that CPU can't do anything useful if the data it's meant to be processing hasn't reached it yet. It will spend a lot of its time waiting for the rest of the system to catch up and supply it with more data; the faster the processor, the more time it'll likely spend waiting.

The question when processing data is 'how much data is enough data to keep my processor busy?'. Time for an example from the real world.

An example from the real world

For this example, a large CSV file was loaded using code very similar to the snippet above, and the process was monitored with SAS Enterprise Session Monitor. So information about the data load step’s performance, and other information like the rate at which the data was reaching the processor, and how busy it was as a result.

The following graph is a straight export from ESM.

From this graph, we can see that this code started executing at 17:00:35 and ran for just under two minutes, until 17:02:30. Here are a few other things the graph shows.

CPU Capacity

Throughout the time it ran, the code used close to 100% CPU. While the 'percent CPU' process metric used to describe CPU utilization in this context is commonly used, it can be quite confusing. Technically it means that 'a program which is running at 100% CPU is using the amount of CPU cycles equivalent to the capacity of a single processor core, or utilizing all processing time available to a single thread of execution'.

A word on threads. Some programming languages, and many SAS Foundation procedures, can split their work into multiple independent streams of instruction, enabling them to distribute their work to multiple workers - multiple threads. This is how they're able to 'go above 100% CPU', by utilizing more than one CPU core at a time. CAS, the core compute component of SAS Viya, is a good example of a compute engine that does almost all its processing in this way. CAS not only distributes its work across multiple CPU cores but is able to also distribute it across multiple nodes in a cluster, achieving 'massively parallel processing' and huge gains in performance and processing speed.

However, older procedures like the classic SAS DATA step, or the Python runtime (with its Global Interpreter Lock), are strictly single-threaded. They are only capable of processing one linear stream of instructions at a time, and a single piece of code can never use more than 100% of CPU - regardless of how many available CPU cores it has as its disposal.

You can think about it like this: when a step flatlines at 100% like the one in this chart, it is a single-threaded step. As such, because it's flatlining at its peak possible utilization, it is running at optimal performance, as efficiently as it possibly can. In other words, here the processor *is* being supplied with enough data to keep it as busy as it could be, and the limiting factor in the completion time of this code is indeed the processor.

Data Throughput

But at what rate does the data need to flow to and from the processor for it to maintain this optimal 100% CPU utilization? How much of that CSV data is it capable of converting into that easier-to-read-back-in-later SAS binary format in each time period, and how much of the resulting binary data does it produce as output in that time?

Hovering over the chart above shows that for this code, this rate was around 100 MB per second for the incoming CSV file being read from disk, and a little over 200MB per second for the binary dataset being produced and written to disk.

The SAS binary dataset this step produces will be 'easier to read next time round'. One reason that future steps will find it easier to read is precisely because they won't have to figure out where each bit of data starts and stops - the work the data step does when it first reads the CSV file from disk. In the binary dataset, that data is stored in fixed width variables, where each record and each column are aligned and occupy the same amount of space on disk. The downside to this is that, because all those trailing blank characters are also written to disk, the output.sas7bdat binary dataset takes up more disk space. This also explains why, in this graph, the resulting data is written at twice the rate that the source data is read in.

With that in mind, if we were in a situation where we didn't have enough disk space to store that padded out file, or our target disk device wasn't fast enough to cope with a 200MB/s write speed, we could tell SAS to compress that output before writing it to disk. This would change the way those blanks are stored and drop the size of the output file by around 60%, and therefore also reduce the rate at which it needed to be written. However, from the analysis earlier, we know that the CPU already has no time spare (it was running at 100%) and asking it to also apply compression to the output data would give it even more work to do. Doing that in this instance would increase the overall completion time for our load step by around 30%, and as we are not currently sharing this instance with anyone else, that's not something we care about.

You could say that here, we are choosing to keep the computational complexity of our program lower, because it lets our CPU process more data faster. We are doing that because, as our metrics suggest, our instance can cope with the higher disk throughput and extra storage space required to execute the data load in the shortest possible time.

Computational Complexity

Loosely speaking, computational complexity is this idea that sometimes, when performing a complex operation on a given volume of data, the processor can take a long time to finish processing that volume of data. Other times, when performing a different simpler operation on the same volume of data, it can finish that work much, much faster. It all depends on how complex, or computationally expensive, the thing it's been asked to do with that data is.

Here's another example program execution. I started with the same code as above, but this time I added an extra step to describe the dataset (a PROC Means), and another to sort it (a PROC Sort).

Here, the load of the CSV data again took around 2 minutes. Then, describing the data then took another 2 minutes, and sorting it took a little over 4 minutes.

Let's focus on the first two steps for now - the Data Step and the Proc Means.

We know what the first step does - it reads data from a CSV file and writes it to a SAS dataset. What the new, second step then does is read the dataset created by the first step and calculate some summary statistics for it. For this to happen, the same data written by the first DATA step is read back by the second - the PROC Means step. This is why, if you squint a little, you can make out that the volume of teal bars covering the second step is the same as the volume of soft red bars covering the first.

However, while the two steps do work on the exact same volume of data, there's an obvious difference between them in terms of the amount of CPU resource they require.

You'd be forgiven for assuming that the second step, the one that performs actual calculations on the data, would require more processing power than the first, which just loads it. But, because in this case the data contains few numeric variables, and the input dataset is already in that SAS binary format, the Means procedure doesn't have to work very hard there. The entire second step barely needs any CPU time to do its thing.

So, if that's the case, if it requires barely any work to process that data, why does the second step take almost as long to complete as the first?

The answer to this may seem obvious by now. It's because all of that data still needs to be read back from disk for the Means procedure to process it. Unlike the first step, this instance of the Means procedure is a typical example of a step where the storage system isn't able to supply the processor with data at a sufficient rate to make use of all the available CPU time. To approach 100% CPU, it would need to read it in at about 1.1GB/sec.

A chart showing this in SAS Enterprise Session Monitor showing a job only using a fraction of a CPU thread while there are still plenty of CPU resources available on that node, almost always indicates that the job is either waiting on more data to be read in or waiting for the data it's producing to be written before more can be processed. The hold-up can be either a network connection, a database query, or like in this case, the artificial data throughput limits our cloud provider is applying. The limit is reducing the read speed from our virtual disk device to 200MB/second because of the instance type we have selected for this particular node in our cluster.

Finally, in the chart above there is a final step, which Sorts the dataset produced by the first step. It is included here as an example of what other, 'less simple' procedures can look like in SAS Enterprise Session Monitor. PROC SORT is an example of a step that uses memory and depends on memory limits, and only uses the disk if it realizes that it can't do what it needs to within those memory limits. It's also an example of a multi-threaded operation.

Why is any of this Important for me?

The huge range in the computational complexity of operations that data scientists commonly throw at their data is a key characteristic of analytic workloads. Understanding the underlying fundamentals of how your code uses system resources will help you build better, more efficient data jobs, particularly when building data engineering & data management pipelines.

Why should you care if your code is better or more efficient?

If you are using SAS professionally, there is a very good chance that you share your server with multiple other users, maybe even multiple teams of users. The resources on that server will be limited, and writing code that make best use of what you have available will result in faster completion times and an increase in system responsiveness and overall capacity.

The ability to look at a job profile and recognize what resource intensive code looks like can be extremely powerful in quickly identifying the root causes of poor system performance. The global observability that SAS Enterprise Session Monitor offers let experienced developers keep an eye on the workloads their teams are running and perform on-the-fly root cause analysis, identifying bottlenecks and even mentoring colleagues who might not know that the code they are generating could run faster or be more efficient. We've seen this 'collaborative analysis' approach result in collective increases in performance that more than double the effective capacity of a platform.

The Cloud

If you are considering moving your analytic workload to the Cloud, a precise understanding of your resource requirements can play a huge role in both your projected cloud infrastructure costs and your confidence in those projections.

All the major Cloud providers offer a range of instance types that are skewed towards CPU, disk or memory intensive workloads. All the providers allow you to scale each of those instances up, with the same resource skew, almost indefinitely. Deciding which instance type to choose, or which batches are best suited to which instance type and size, is difficult to do without properly profiling your workloads. Your IT department may be able to derive some of this data from the physical hardware you're running your workloads on now, but often, this kind of sizing results cloud infrastructure cost projections that are astronomical.

An exact knowledge of the size and skew of your workloads' resource requirements, and consistency with which they use those resources - broken down by individual flow, job or group of users - lets you identify and select quick win candidates that can easily be 'lifted and shifted' to the Cloud in their current state. It can highlight candidates for optimization that can be augmented and migrated in a second wave and drive the decision around which parts of your pipeline need to be re-engineered before they are redeployed. It enables a precise, data-first approach to cloud architecture that can eliminate guesswork and make the move much simpler. Digital transformation with confidence.

Conclusion

In this post we looked at how the CPU is “fed” with data, we learned about the resource requirements of different steps and operations, and talked about how those requirements can impact overall system performance.

Find more articles from SAS Global Enablement and Learning here.