Contemplating Uptime SLA for SAS

2 Likes

SAS software offers a myriad of solutions and products for delivering value to our customers. To achieve the reach and flexibility demanded by many enterprise environments, our software can be deployed in a variety of different ways ranging from simple SAS runtime installation on a user's desktop machine to multi-tiered deployments across many hosts in different data centers (and/or cloud providers) which service hundreds or thousands of users.

When it comes to planning for the large-scale software deployments, many customer IT organizations want to establish service-level agreements (SLA) for the system's availability to perform work (a.k.a. uptime). This is an area to visit with caution. The implications of contractually promising a system's uptime hinge on many factors and those factors are not all in the software vendor's demesne. This applies equally to SAS. It is important to understand and document the expectations and responsibilities for ensuring a system's uptime - and also exactly what the definition of uptime really means for the site.

Let's take a look at some of the uptime considerations for a SAS deployment.

Typical Uptime Values

Customers want the software they pay for to be available for work whenever they need it. That is balanced against known challenges where the software system requires occasional maintenance, underlying hardware needs repair or replacement, unexpected user activity (malicious or not) consumes more resources than normal, etc.

The following chart from Wikipedia shows the implications of adding "another 9" to uptime targets:

Availability Measure	Downtime Per Year	Downtime Per Week
90% (one nine)	36.5 days	16.8 hours
99% (two nines)	3.65 days	1.68 hours
99.9% (three nines)	8.76 hours	10.1 minutes
99.99% (four nines)	52.6 minutes	1.01 minutes
99.999% (five nines)	5.25 minutes	6.05 seconds

Five 9's is a common target for equipment and services in the telephony industry - and notice that allows for less than 6 minutes of downtime in a year.

While SAS software is usually an important business tool, it's not commonly subject to that high level of uptime SLA. 90 - 99% is a more reasonable high-end goal we expect to see for SAS, but even that requires clear definition and planning with your customer's IT to achieve. Further, achieving that high level of availability comes with higher costs and requirements which increase non-linearly as more nines are added behind the decimal.

Infrastructure

SAS software is often an end-user's primary touchpoint to their enterprise analytics and reporting solution. When not available, then the call goes out, "SAS is down!" Is that the moment when the uptime clock should stop ticking? Maybe, but there could be more to it once the problem identification process gets underway.

SAS software is dependent on the infrastructure it runs on. If there's a failure on the server, or elsewhere in the data center, or even in the cloud hosting provider, then SAS software might not be reachable. This is an important distinction when documenting and calculating the uptime SLA of the SAS software.

For example, Amazon Web Services explains they "will use commercially reasonable efforts to make the Included Services each available for each AWS region with a Monthly Uptime Percentage of at least 99.99%." Microsoft Azure breaks down its SLA per service, and for its Cloud Services and Virtual Machines offers at least 99.9% for any single instance. Google Cloud Services provides a similar breakdown and offers minimum 99.5% uptime for single instances of its Compute Engine. These same factors should be weighed for a customer's on-premise data center as well (if those statistics and SLA are known).

Without accounting for these uptime SLAs from the infrastructure providers, it would be irresponsible to agree to any higher SLA for SAS software in those environments. In other words, don't contractually promise that a SAS solution will be available 99.999% of the time on a system which only guarantees 99.5% uptime unless you break out exactly which elements of that uptime can be reasonably accounted for.

Maintenance & Administration

SAS software requires regular maintenance and administration. Updates are released on a planned schedule and hotfixes more frequently. Applying those updates and hotfixes might require taking a portion or even the entire SAS solution offline. Further, other processes, such as regular backups may require an interruption to service availability to ensure data integrity at all levels.

And while an infrastructure provider may guarantee the availability of a host machine's hardware, the ongoing operation of a site will necessitate updates to related components such as the operating system, hypervisor software, storage solution, etc.

These are examples of planned outages. The time required to implement planned outages should not be counted against a service's uptime SLA.

SAS High Availability

SAS solutions offer high availability options to improve service availability in the face of unplanned outages. And there are some circumstances where the HA ability of SAS can reduce the amount of planned downtime for a SAS service due to system maintenance… but not universally in all circumstances.

Some quick examples:

Improved availability against unplanned outages:
In SAS Viya, multiple instances of individual microservices can be deployed to different hosts. So if one host fails, the sibling microservice(s) on other host(s) can provide full functionality without interruption.
Reduce planned downtime for system maintenance:
The SAS 9.4 Grid Manager solution offers features to isolate worker hosts and quiesce jobs on demand. The grid remains available for work at a diminished capacity, but some hosts are effectively offline allowing the opportunity to perform some kinds of maintenance and testing (typically OS-level updates) without affecting the overall availability of the SAS service to end-users.
Resilient, but requiring planned outage to fully recover:
In SAS Viya, an MPP deployment of the SAS Cloud Analytic Services can include an optional Secondary CAS Controller. If the Primary CAS Controller goes down, then the Secondary CAS Controller will ensure continuity of operations. However, this mode is only temporary. Once the root cause has been addressed and the Primary CAS Controller brought back online, a planned service outage of the entire CAS service is required to resume normal operation with the Primary CAS Controller managing the cluster.

The last two bullets illustrate some special considerations for calculating the uptime of SAS in support of an SLA: partial availability with diminished capacity as well as planned outage to accomplish complete recovery even without an interruption of service during the critical event.

Recovery

Assuming an unplanned outage has occurred, and the SAS solution must be restarted from scratch, when should the uptime clock start ticking for SLA? Is it when the end-users can resume normal work? Maybe. It's certainly a point to examine.

But also consider that for some large-scale implementations of SAS Viya, a significant volume of data might need to be reloaded into the SAS Cloud Analytic Services. Depending on the transfer rate, this might occur in seconds, minutes, or even hours. So when the SAS solution is actively running properly on its way to restore normal operations for end-users, then that should arguably be considered when defining the SLA for uptime as well.

Unplanned Outages

There are more reasons than can be counted which might cause unplanned outages. Normally when trying to determine the uptime of SAS for SLA purposes, we'd be on the lookout for software bugs or crashes, errors in the patching/update process, and similar problems that interrupt the availability of the service. And if those occur, then it should be counted as downtime.

Other reasons, however, might need root-cause analysis to determine whether it's something that the SAS software should manage gracefully or is it an external problem that affected SAS? So a process should be in place which determines whether an incident should count against SAS' uptime SLA. Consider the following examples for the risk they might have on the uptime of the system:

Neglecting the system, avoiding updates and patches
Not testing the application process of updates and patches on a non-production environment
Avoiding regular system maintenance, such as backups, and having no good point to restore to, meaning lost work must be recreated
Overloading the system with too much work, too many users, overcommitting resources like RAM, CPU, and disk, and so on. This could either be malicious (like DDoS attacks) or risky behavior (allowing untrained users unrestricted access).

Risky behavior is the hardest to quantify and plan for. SAS software offers powerful data manipulation and analysis capabilities to get work done. Overly protecting those capabilities might put an undue burden on users to accomplish those jobs. On the other hand, inadequately trained users or undersized hardware might inadvertently overwhelm the system.

The end result of these kinds of problems is likely an interruption of services for SAS users. But the determination of root cause will affect the accounting of time when calculating SAS' ability to meet its SLA agreement.

RPO & RTO

Outages, either planned or unplanned, are bound to happen. Many organizations define metrics to capture and measure the impact those outages have on the business.

The RTO (recovery time objective) describes how much time an application can be down without significantly impacting the business. Many implementations of SAS software are not critical to the core business of a company, and so this time could be measured in hours or days or longer.

The RPO (recovery point objective) is another standard metric which refers to the amount of data that can be lost without significant harm to the company. Again, SAS often is used to analyze data which was collected by other processes, and so this measurement might also typically range as hours, days, or longer for SAS as well.

Of course, the definition of these metrics could change based on one's perspective. If instead of looking at the company as a whole, we defined these for a process of which SAS is an intrinsic component, then the target values might be much lower. For example, consider SAS Event Stream Processing software might be deployed to edge nodes for data collection as well as the intermediate components which bring that data into the enterprise.

The point is, when defining metrics such as RPO and RTO (and MTTF, MTBF, MTTR, etc.) that contribute to SLA, take the time to clearly explain what they mean for your customer, the business process, and for SAS.

Calculating Uptime

This is where the challenge lies. Customers often come to the table simply requesting something like 99.5% uptime for their software solutions - or even higher for mission critical systems - but without any real qualifications or definitions as to what that means and how to calculate it.

It's in everyone's best interest to acknowledge that determining the criteria for an uptime SLA varies by site, by the customer IT team's practices, by software capabilities, and by user interaction. As discussed above, the perception of end-users as to the availability of the system is only part of the equation. There could be a lot under the surface that users are not aware of that should be accounted for when calculating a system's uptime.

So here are some guidelines to discuss with your customer for planning to calculate the availability SLA of SAS software.

Mobile users: To view the image, select the "Full" version at the bottom of the page.

Activities that should count towards Uptime (numerator):

Start with the time the system is up and operating as normal, fully available to end users (obviously)
Include time when running at diminished capacity (due to the loss of a cluster node) where the service is still available on the remaining hosts
Include time after successful startup for recovery, when services are running, but the system is performing necessary work to load data and other operations required for end-users to resume their tasks
Subtract time spent in unplanned outages which the system failed to handle as intended.

Activities included in counting the Total Time (denominator):

Start with the total amount of time since the production system went live
Subtract time for planned outages for updates and administration (such as weekend activities including full system backups)
Subtract time for unplanned outages where the responsibility is beyond SAS software's ability to control

When writing the documentation which defines a SAS software project's scope, requirements, design, implementation, and ongoing administration, it is important to identify and refine the definition of these factors and their impact to the calculation of the SLA for uptime.

More Information

If you'd like to understand more about the high availability options offered by SAS software to help support uptime SLA, check out the following posts by Edoardo Riva:

JuanS_OCS · ‎08-24-2020

Definitely a must for every Architecture document and Service Agreement document. Thank you @RobCollum for this post.

Love it, very through. It seems to be an exceptional addition for the Check list of SAS Administrator Tasks (9.4 and Viya) maintained by @DavidStern

If anyone is interested as I am, to take it from there:

The individual machine/service availability (say, in the Azure cloud, where contract promises 99.99% availability) won't be enough for systems composed by more than 1 server or service.

This specially relevant for complex systems with different tiers, such as SAS Grid Manager or SAS Viya, or Highly Available ones, where both multiple SAS nodes, and a Distributed File System are involved, and both can have a relevant complexity and availability.

Serial and parallel dependencies will be relevant on the sub calculations and the final calculation.

reliability-block-diagram

In this picture from above you could imagine a relatively SAS system, where we can have a SAS Metadata server, a few SAS Compute nodes, a SAS Web Application server and a company reverse proxy. Without considering items such as databases or storage.

Parallel Availability

to calculate: A = 1-(1-A_x)²

Serial Availability

to calculate: A = A_xA_y

There are tools and programs out there to help with this process. Some examples:

https://www.bmc.com/blogs/system-reliability-availability-calculations/

https://www.delaat.net/rp/2013-2014/p17/report.pdf

https://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm

https://www.weibull.com/hotwire/issue79/relbasics79.htm

I must admit, this is not an easy topic if you go enough in detail. It can be a bit overwhelming in the beginning. But there is plenty of help, and once you practice a few times, it becomes easier.

@RobCollum , I am wondering, is there a tool or tip that you use or would recommend, to help calculations for complex SAS systems up time?