Solved: Can suspended flows affect overall grid performance?

Jason1 · Posted 01-18-2018 07:22 AM

Hi all,

Does anyone know if have SAS flows in a status of 'Suspend' could affect overall grid performance?

We are having a lot of performance issues so trying to find anything that could help.

Thanks in advance,

jason

boemskats · Posted 01-21-2018 08:28 PM

Hi Jason,

This is a very good question. Yes, I do believe that suspended flows can have a considerable effect on performance. This is down to the combination of the 'disk-intensive' nature of some SAS processing and the way that operating systems deal with actually writing that data to disk. It is also a subject that @MargaretC has approached a few times, such as in her post from a couple of years ago titled 'When can too much memory hurt SAS'.

I'm not sure which OS you're using, but I'll assume it's Linux for the rest of this post. I'm not sure if LSF can suspend jobs on Windows. In any case, if you're using Windows, I'm sorry.

So, how does a kernel write data to disk? Quoting this article titled Linux Page Cache Basics:

If data is written [to disk], it is first written to the Page Cache [which is itself unused RAM] and managed as one of its dirty pages. Dirty means that the data is stored in the Page Cache, but needs to be written to the underlying storage device first. The content of these dirty pages is periodically transferred (as well as with the system calls sync or fsync) to the underlying storage device. The system may, in this last instance, be a RAID controller or the hard disk directly.

What this means is that if you have a job that is writing data to a SASWORK disk device, and the node that it is running on has a large amount of otherwise unused memory, then this job can flood the Page Cache with huge amounts of data that, as far as it is concerned, it is writing to an awesomely fast SASWORK disk. If that job is then subsequently suspended by LSF, it will be stopped from using any more CPU resource, but the kernel pdflush daemon will continue to sync the contents of the Page Cache to the disk device until all of that data is written (seeing as it effectively told the job that it had already written it to disk). This means that while your flow is 'suspended', the 'disk load' it generated while it was running can continue to have a latent effect on the performance of other jobs that are trying to use the same disk device. The severity of this effect will depend on a few things, like the amount of free memory you have that's eligible to be used as page cache, the amount of bandwidth available on your storage device, the point at which the program was suspended, and your kernel cache tuning configuration.

To help illustrate this, here's an example: we have a node with 128 gigs of RAM (nothing special by modern standards), 20 CPU cores (although this is irrelevant as we will only use one), and a SASWORK disk which, although it has disproportionately little bandwidth available for the purposes of illustrating this point, is still faster than what I often see on some customer sites (120MB/sec).

We run the following code on the server, and halfway through its execution we 'suspend' it:

%let howbig=12e7;

%esmtag(Create dataset);
data sascoms;
  array me {1} $200;
  do id=1 to &howbig.;
    randid = round(ranuni(0) * &howbig.);
    output;
  end;
run;

This data step creates around 20gb of data in SASWORK (which is on our 140mb/sec disk device). Here is what that looks like on a default configuration in RHEL 7.4:

First, some help interpreting these graphs:

the top graph shows the performance of the SAS job, the bottom graph shows the performance of the node for the same time period
red area on both graphs is CPU. 100% in the top one signifies one _thread_, while 100% on the bottom one is the total CPU capacity available on the node
the green bars in both show write speed: the green bars in top graphs show the _rate at which the process is writing data to the kernel_ (so, writing to the cache), and the green bars on the bottom graph show _the rate at which the kernel is actually writing that data to the device_ (so, flushing that page cache to disk)
finally, the grey area 'descending' from the top of the bottom node graph is the measured size of the page (buffer) cache, which includes both pages that have flushed to disk and pages that have yet to be flushed (dirty)

We can therefore observe the following: The job starts executing the code above at 15:17:42, writing to the kernel (page cache) at ~750MB/second, which is the SASWORK throughput the piece of code above requires in order to sustain near 100% CPU utilisation (ie. to fully utilise a single thread). When the job is then suspended around 25 seconds in, the job's CPU and IO load drop away to 0, and the cache stops growing (grey bit on bottom graph). However, by this point, as far as the job is concerned it has written a 20GB dataset to SASWORK, and by looking at the bottom graph you can see that the kernel continues to flush the page cache to the disk device even though the job is in a suspended state, continuing to max out the write throughput of the disk device. In total, it takes an extra 1m20s after the job is suspended to finish syncing the data that it managed to 'write' to the cache in the 20 seconds it was active. In other words, in this example the 'latent dirty cache effect' lasts almost 4x longer than the actual runtime before it was suspended, and would almost certainly continue to impact the performance of any flows that were resumed following the first flow's suspension.

Luckily, like many other things on Linux, the size of this dirty page cache is tunable. Here is how the same program behaves when the vm.dirty_ratio tuning parameter is reduced from 40 to 1, telling the kernel that instead of the (default on rhel7) 40%, only 1% of total free memory should be used for the dirty page cache:

This time round the job starts executing at 15:23:50. The write throughput between the job and the cache initially spikes to 372MB/sec, but this is almost immediately throttled down to a much saner 120MB/sec as soon as the (now much smaller) dirty page cache fills up, and the size of the cache (bottom graph) grows gradually, unlike before. As a result, when the job is suspended at 15:24:45, the kernel only takes another 10 seconds or so to complete flushing the dirty page cache to disk. Much better, this suspended job wouldn't affect the performance of other newly started flows anywhere near as much as that first one. And when the job is resumed at 15:25:30, it just picks up where it left off.

So, there's our answer, right? In order to stop suspended flows adversely affecting the performance of active ones, we should simply make the dirty page cache tiny?

Not quite. Here's what happens when, instead of suspending that job mid-execution, we let it do its thing and carry on to completion:

When the job completes it cleans up its SASWORK files that it thinks are in its work dir on the disk, which clears them from the dirty cache and stops them being dumped to disk. Voila. Not only that, but the job completes in 30 seconds, rather than the 3+ minutes that it would take if it was to rely on the SASWORK disk device throughput alone.

Of course, this is all for illustration purposes, and the severity of this effect will depend on the actual performance profile of your code. Even so, with our ESM customers we do see a surprising number of jobs in the wild that seem to create some sizable temporary files immediately before termination. I guess that if nothing else, this is also a good way of illustrating the importance of proactively deleting SASWORK datasets in your jobs as soon as you know they're no longer needed to save them being flushed to disk for no reason. The bigger your cache, the more difference doing this will make.

Now this may seem like an extreme example, as no job would write a temporary SASWORK file only to immediately delete it. But, this is exactly how SAS uses UTILLOC, and this is why I consider it to be the most significant element of this 'suspended job cache hangover effect'. SAS procedures which use UTILLOC as their temporary storage normally only utilise those temporary files for the duration of the step's execution, and the temporary files are deleted from that UTILLOC disk as soon as they finish that step. What this means suspending a job while it's halfway through a PROC SORT will produce the exact detrimental performance impact effect shown in the first screenshot above, while letting it finish what it's doing and clean up would likely result in a decent performance profile much more like the one shown on third screenshot. It is in these scenarios that 'suspending' a flow will detriment performance the most, and I do think the effect is very significant, and with the right tooling very measurable.

So TL;DR - Yes, suspended flows can affect overall grid performance. My advice to you would be to therefore avoid suspending flows where possible, instead concentrating on optimising both your schedule, and the efficiency of any jobs that are on the critical dependency path within your schedule. If you have to suspend flows, spend some time tuning your cache, and ensure that your config aligns with the tuning guidelines provided by Margaret's team in collaboration with Barry Marston and the guys from RedHat.

Lastly, seeing as you're having issues with performance, I would highly recommend that you try using Boemska ESM, the performance tuning product for SAS GRID that you can see in these screenshots. We have clients that spent a lot of time and effort trying to improve the performance of their GRID environments using traditional tuning methods before trying ESM, and still managed to make 20-25% gains in performance and batch capacity within weeks of installing our product. I know @JuanS_OCS, for one, is a big fan :). If you're interested feel free to get in touch with me directly.

In any case, I hope this answers your question.

Nik

View solution in original post

boemskats · Posted 01-21-2018 08:28 PM

Hi Jason,

This is a very good question. Yes, I do believe that suspended flows can have a considerable effect on performance. This is down to the combination of the 'disk-intensive' nature of some SAS processing and the way that operating systems deal with actually writing that data to disk. It is also a subject that @MargaretC has approached a few times, such as in her post from a couple of years ago titled 'When can too much memory hurt SAS'.

I'm not sure which OS you're using, but I'll assume it's Linux for the rest of this post. I'm not sure if LSF can suspend jobs on Windows. In any case, if you're using Windows, I'm sorry.

So, how does a kernel write data to disk? Quoting this article titled Linux Page Cache Basics:

If data is written [to disk], it is first written to the Page Cache [which is itself unused RAM] and managed as one of its dirty pages. Dirty means that the data is stored in the Page Cache, but needs to be written to the underlying storage device first. The content of these dirty pages is periodically transferred (as well as with the system calls sync or fsync) to the underlying storage device. The system may, in this last instance, be a RAID controller or the hard disk directly.

What this means is that if you have a job that is writing data to a SASWORK disk device, and the node that it is running on has a large amount of otherwise unused memory, then this job can flood the Page Cache with huge amounts of data that, as far as it is concerned, it is writing to an awesomely fast SASWORK disk. If that job is then subsequently suspended by LSF, it will be stopped from using any more CPU resource, but the kernel pdflush daemon will continue to sync the contents of the Page Cache to the disk device until all of that data is written (seeing as it effectively told the job that it had already written it to disk). This means that while your flow is 'suspended', the 'disk load' it generated while it was running can continue to have a latent effect on the performance of other jobs that are trying to use the same disk device. The severity of this effect will depend on a few things, like the amount of free memory you have that's eligible to be used as page cache, the amount of bandwidth available on your storage device, the point at which the program was suspended, and your kernel cache tuning configuration.

To help illustrate this, here's an example: we have a node with 128 gigs of RAM (nothing special by modern standards), 20 CPU cores (although this is irrelevant as we will only use one), and a SASWORK disk which, although it has disproportionately little bandwidth available for the purposes of illustrating this point, is still faster than what I often see on some customer sites (120MB/sec).

We run the following code on the server, and halfway through its execution we 'suspend' it:

%let howbig=12e7;

%esmtag(Create dataset);
data sascoms;
  array me {1} $200;
  do id=1 to &howbig.;
    randid = round(ranuni(0) * &howbig.);
    output;
  end;
run;

This data step creates around 20gb of data in SASWORK (which is on our 140mb/sec disk device). Here is what that looks like on a default configuration in RHEL 7.4:

First, some help interpreting these graphs:

the top graph shows the performance of the SAS job, the bottom graph shows the performance of the node for the same time period
red area on both graphs is CPU. 100% in the top one signifies one _thread_, while 100% on the bottom one is the total CPU capacity available on the node
the green bars in both show write speed: the green bars in top graphs show the _rate at which the process is writing data to the kernel_ (so, writing to the cache), and the green bars on the bottom graph show _the rate at which the kernel is actually writing that data to the device_ (so, flushing that page cache to disk)
finally, the grey area 'descending' from the top of the bottom node graph is the measured size of the page (buffer) cache, which includes both pages that have flushed to disk and pages that have yet to be flushed (dirty)