Unlocking SAS Performance with Enterprise Session Monitor Heatmaps

SAS Enterprise Session Monitor Usage Series Part 1

I’ve frequently referred people who were interested in learning practical information about Enterprise Session Monitor usage to two blog posts on the old Boemska website. But as all good things on the internet do over time, the old host disappeared. So, I’ve reached out to the Enterprise Session Monitor team and received permission to bring the information back, in the hopes that you too will find it useful. - Erik Pearsall

Links

Obsessing over Observability

SAS Enterprise Session Monitor - Obsessing over Observability - SAS Users

ESM Deployment Using Ansible

SAS Enterprise Session Monitor – Deployment Using Ansible

Custom Process Monitoring with filterSpec in Enterprise Session Monitor

Taking Control: Custom Process Monitoring with filterSpec in Enterprise Session Monitor

Tag! You're it! Mastering ESM Tags.
Tag, You’re It!: Mastering ESM Tags

SAS Enterprise Session Monitor Usage Series Part 2

How Analytic Workloads Work

Using Enterprise Session Monitor Heatmaps

A version of this article originally appeared on boemskats.com in July of 2020 as “Unlocking SAS Performance Secrets” by Chris Blake. This has been modified and reposted with permission from the Enterprise Session Monitor team.

One of the most useful features of ESM is the "Top 50 Heatmap". It provides a neat, very visual way of quickly seeing what's happening (live or historically) on a server at any point in time. It's a bit like a spectrogram, but for workloads. It can instantly show you what your users and jobs are doing and even highlight if they are affecting each other. In cases like the one in this post, it can also show fundamental issues with the underlying system configuration (spoiler alert!).

What we're looking at above is a picture of the top 50 most active user sessions on a SAS 9.4 server, showing their effective CPU utilization over the selected 5 minute timespan. Looking at the heatmap, you can quite clearly see a pattern where most of those 50 user sessions - I count around 30 - are trying to do sustained work but are really struggling. Every 30 seconds or so there is a burst of activity, where CPU utilization increases in perfect sync - but only for a few seconds at a time. Not long after, those sessions go back to barely using any CPU at all. The scale is not in the image, but a hover tooltip shows us that the darkest of those red tiles represents around 90% of a single CPU, which is good performance for typical workloads.

On a day-to-day basis, our customers typically use the Heatmap feature to identify when a process is using a disproportionate amount of a resource and causing problems for everyone else. In that scenario, you would see a pattern like this one, but there would be some lines that get darker just as the others get lighter, thus highlighting the greedy process(es) causing the problem. These situations also tend to be a lot more obvious when using ESM's disk I/O heatmaps, which highlight when disk-hungry or inefficient code is affecting the performance and developer experience for everyone else.

However, in this case, you can see that there aren't any processes that stand out in that way. This suggests that there may be a real environment-wide resource availability issue at play. Looking at overall CPU usage on that node showed that there was plenty of CPU time available during those periods of starvation (there's a node-level graph further down this post), so this suggests that this may be a problem with getting enough data to the CPU. In other words, the environment either has insufficient or misconfigured storage infrastructure, or it's just not getting the provisioned disk bandwidth.

Time to pick up the phone and speak to the storage team.

Talking to IT and Infrastructure teams

In most organizations - but particularly large enterprises - this is where the fun and games start. We're confident that the issue is caused by insufficient disk bandwidth. However, especially as this server is virtualized, the storage team will quite likely say something like: "This is not a storage problem. You need to speak to the VMWare team. Your machine is probably starved of CPU because you are running on an over provisioned VM host. We know those guys; they're always doing that".

At this point, you may be in for a lengthy game of ping-pong.

However, because the SAS Administrator has SAS Enterprise Session Monitor, they can confidently say that over provisioning and CPU starvation is not the issue here. Here's why:

Let's look at that heatmap again. You will notice that among the sea of simultaneously suffering processes, there are some that don't seem to be affected by this problem at all: the 5th one from top, the one 12 sessions below that, then two below that one, then the second one from bottom. Here is the above image, but highlighting the sessions I'm talking about:

These outliers prove that this is not a CPU starvation problem. If it was, they too would display the exact same CPU starvation phasing pattern, which they do not. These outliers appear to be getting almost optimal performance on an otherwise extremely slow host. Whatever they're doing, it probably doesn't depend on those slow disks.

Drilling in to look at the performance graphs for each of these processes showed us that in each case the amount of disk throughput they were using was very low and the size of their individual SAS WORK and UTIL directories was close to zero, meaning the data they were getting was probably coming from another disk.

In other words, it was clear that these exceptions - these optimally running processes - were either doing in-memory work or running computational tasks that didn't seem to be interacting with the SAS WORK disk.

Now is probably a good time to remind ourselves of the rule of thumb for architecting SAS "mixed analytic workloads". The rule says to have a SASWORK storage volume capable of at least 100MB/second sustained write, and 100MB/second sustained read throughput for each CPU core. (Important Performance Considerations When Moving SAS to a Public Cloud).

With that in mind, here is the overall node-level chart I mentioned earlier. It shows CPU utilization in blue, with light green and soft red bars representing device-level reads and writes to/from the SASWORK disk:

Without the support of our session-level heatmap, anyone could assume from these host-level metrics that the server is underutilized, except for those occasional bursts of activity. But, because of our heatmap, we can prove that this is not the case.

What we can observe from this graph is that the sustained disk bandwidth the customer is getting from their SASWORK storage device is clearly not enough for an 8 vCPU machine (note the scale on the right-hand y-axis). However, when provided with sufficient bandwidth during those ~800MB/sec bursts, the SAS workload successfully utilizes ~80% of available CPU, which is exactly the optimal performance we would expect to see. The peaks in this graph line up perfectly with the dark phases in our heatmap, and the bursting pattern seen here is typical of storage that depends on a small fast cache capable of providing the required bandwidth.

We're only getting our required throughput while the cache is available, for a few seconds at a time - enough to get past a simple post-install infrastructure validation test, for example. At all other times, while that cache is getting flushed to disk, we get the underlying disk's true sustained rate of throughput. That real, sustained rate is clearly not enough to feed the CPU the data it needs. That is why we see the phasing patterns in our heatmaps, and that is why the end users are suffering and complaining about unusually poor performance.

Conclusion

It's often tough talking to your IT department and trying to explain why you believe there is a problem with the underlying infrastructure they've provisioned for you. They'll look at the graphs they have, see CPU headroom, and assume that there's nothing wrong and tell you the problem probably lies within your application. And to be fair to them, with most other applications they manage, and with most business users they interact with, that would be the correct answer. But SAS works a little differently, and SAS users aren't typical business users. Our clients very often find that the granular level of observability SAS Enterprise Session Monitor provides helps them either solve their own problems or enables them to work together with IT to almost instantly resolve issues that they've been working on for months - sometimes, years.

Find more articles from SAS Global Enablement and Learning here.