SAS software offers a myriad of solutions and products for delivering value to our customers. To achieve the reach and flexibility demanded by many enterprise environments, our software can be deployed in a variety of different ways ranging from simple SAS runtime installation on a user's desktop machine to multi-tiered deployments across many hosts in different data centers (and/or cloud providers) which service hundreds or thousands of users.
When it comes to planning for the large-scale software deployments, many customer IT organizations want to establish service-level agreements (SLA) for the system's availability to perform work (a.k.a. uptime). This is an area to visit with caution. The implications of contractually promising a system's uptime hinge on many factors and those factors are not all in the software vendor's demesne. This applies equally to SAS. It is important to understand and document the expectations and responsibilities for ensuring a system's uptime - and also exactly what the definition of uptime really means for the site.
Let's take a look at some of the uptime considerations for a SAS deployment.
Customers want the software they pay for to be available for work whenever they need it. That is balanced against known challenges where the software system requires occasional maintenance, underlying hardware needs repair or replacement, unexpected user activity (malicious or not) consumes more resources than normal, etc.
The following chart from Wikipedia shows the implications of adding "another 9" to uptime targets:
Availability Measure | Downtime Per Year | Downtime Per Week |
90% (one nine) | 36.5 days | 16.8 hours |
99% (two nines) | 3.65 days | 1.68 hours |
99.9% (three nines) | 8.76 hours | 10.1 minutes |
99.99% (four nines) | 52.6 minutes | 1.01 minutes |
99.999% (five nines) | 5.25 minutes | 6.05 seconds |
Five 9's is a common target for equipment and services in the telephony industry - and notice that allows for less than 6 minutes of downtime in a year.
While SAS software is usually an important business tool, it's not commonly subject to that high level of uptime SLA. 90 - 99% is a more reasonable high-end goal we expect to see for SAS, but even that requires clear definition and planning with your customer's IT to achieve. Further, achieving that high level of availability comes with higher costs and requirements which increase non-linearly as more nines are added behind the decimal.
SAS software is often an end-user's primary touchpoint to their enterprise analytics and reporting solution. When not available, then the call goes out, "SAS is down!" Is that the moment when the uptime clock should stop ticking? Maybe, but there could be more to it once the problem identification process gets underway.
SAS software is dependent on the infrastructure it runs on. If there's a failure on the server, or elsewhere in the data center, or even in the cloud hosting provider, then SAS software might not be reachable. This is an important distinction when documenting and calculating the uptime SLA of the SAS software.
For example, Amazon Web Services explains they "will use commercially reasonable efforts to make the Included Services each available for each AWS region with a Monthly Uptime Percentage of at least 99.99%." Microsoft Azure breaks down its SLA per service, and for its Cloud Services and Virtual Machines offers at least 99.9% for any single instance. Google Cloud Services provides a similar breakdown and offers minimum 99.5% uptime for single instances of its Compute Engine. These same factors should be weighed for a customer's on-premise data center as well (if those statistics and SLA are known).
Without accounting for these uptime SLAs from the infrastructure providers, it would be irresponsible to agree to any higher SLA for SAS software in those environments. In other words, don't contractually promise that a SAS solution will be available 99.999% of the time on a system which only guarantees 99.5% uptime unless you break out exactly which elements of that uptime can be reasonably accounted for.
SAS software requires regular maintenance and administration. Updates are released on a planned schedule and hotfixes more frequently. Applying those updates and hotfixes might require taking a portion or even the entire SAS solution offline. Further, other processes, such as regular backups may require an interruption to service availability to ensure data integrity at all levels.
And while an infrastructure provider may guarantee the availability of a host machine's hardware, the ongoing operation of a site will necessitate updates to related components such as the operating system, hypervisor software, storage solution, etc.
These are examples of planned outages. The time required to implement planned outages should not be counted against a service's uptime SLA.
SAS solutions offer high availability options to improve service availability in the face of unplanned outages. And there are some circumstances where the HA ability of SAS can reduce the amount of planned downtime for a SAS service due to system maintenance… but not universally in all circumstances.
Some quick examples:
The last two bullets illustrate some special considerations for calculating the uptime of SAS in support of an SLA: partial availability with diminished capacity as well as planned outage to accomplish complete recovery even without an interruption of service during the critical event.
Assuming an unplanned outage has occurred, and the SAS solution must be restarted from scratch, when should the uptime clock start ticking for SLA? Is it when the end-users can resume normal work? Maybe. It's certainly a point to examine.
But also consider that for some large-scale implementations of SAS Viya, a significant volume of data might need to be reloaded into the SAS Cloud Analytic Services. Depending on the transfer rate, this might occur in seconds, minutes, or even hours. So when the SAS solution is actively running properly on its way to restore normal operations for end-users, then that should arguably be considered when defining the SLA for uptime as well.
There are more reasons than can be counted which might cause unplanned outages. Normally when trying to determine the uptime of SAS for SLA purposes, we'd be on the lookout for software bugs or crashes, errors in the patching/update process, and similar problems that interrupt the availability of the service. And if those occur, then it should be counted as downtime.
Other reasons, however, might need root-cause analysis to determine whether it's something that the SAS software should manage gracefully or is it an external problem that affected SAS? So a process should be in place which determines whether an incident should count against SAS' uptime SLA. Consider the following examples for the risk they might have on the uptime of the system:
Risky behavior is the hardest to quantify and plan for. SAS software offers powerful data manipulation and analysis capabilities to get work done. Overly protecting those capabilities might put an undue burden on users to accomplish those jobs. On the other hand, inadequately trained users or undersized hardware might inadvertently overwhelm the system.
The end result of these kinds of problems is likely an interruption of services for SAS users. But the determination of root cause will affect the accounting of time when calculating SAS' ability to meet its SLA agreement.
Outages, either planned or unplanned, are bound to happen. Many organizations define metrics to capture and measure the impact those outages have on the business.
The RTO (recovery time objective) describes how much time an application can be down without significantly impacting the business. Many implementations of SAS software are not critical to the core business of a company, and so this time could be measured in hours or days or longer.
The RPO (recovery point objective) is another standard metric which refers to the amount of data that can be lost without significant harm to the company. Again, SAS often is used to analyze data which was collected by other processes, and so this measurement might also typically range as hours, days, or longer for SAS as well.
Of course, the definition of these metrics could change based on one's perspective. If instead of looking at the company as a whole, we defined these for a process of which SAS is an intrinsic component, then the target values might be much lower. For example, consider SAS Event Stream Processing software might be deployed to edge nodes for data collection as well as the intermediate components which bring that data into the enterprise.
The point is, when defining metrics such as RPO and RTO (and MTTF, MTBF, MTTR, etc.) that contribute to SLA, take the time to clearly explain what they mean for your customer, the business process, and for SAS.
This is where the challenge lies. Customers often come to the table simply requesting something like 99.5% uptime for their software solutions - or even higher for mission critical systems - but without any real qualifications or definitions as to what that means and how to calculate it.
It's in everyone's best interest to acknowledge that determining the criteria for an uptime SLA varies by site, by the customer IT team's practices, by software capabilities, and by user interaction. As discussed above, the perception of end-users as to the availability of the system is only part of the equation. There could be a lot under the surface that users are not aware of that should be accounted for when calculating a system's uptime.
So here are some guidelines to discuss with your customer for planning to calculate the availability SLA of SAS software.
Mobile users: To view the image, select the "Full" version at the bottom of the page.
Activities that should count towards Uptime (numerator):
Activities included in counting the Total Time (denominator):
When writing the documentation which defines a SAS software project's scope, requirements, design, implementation, and ongoing administration, it is important to identify and refine the definition of these factors and their impact to the calculation of the SLA for uptime.
If you'd like to understand more about the high availability options offered by SAS software to help support uptime SLA, check out the following posts by Edoardo Riva:
Definitely a must for every Architecture document and Service Agreement document. Thank you @RobCollum for this post.
Love it, very through. It seems to be an exceptional addition for the Check list of SAS Administrator Tasks (9.4 and Viya) maintained by @DavidStern
If anyone is interested as I am, to take it from there:
The individual machine/service availability (say, in the Azure cloud, where contract promises 99.99% availability) won't be enough for systems composed by more than 1 server or service.
This specially relevant for complex systems with different tiers, such as SAS Grid Manager or SAS Viya, or Highly Available ones, where both multiple SAS nodes, and a Distributed File System are involved, and both can have a relevant complexity and availability.
Serial and parallel dependencies will be relevant on the sub calculations and the final calculation.
In this picture from above you could imagine a relatively SAS system, where we can have a SAS Metadata server, a few SAS Compute nodes, a SAS Web Application server and a company reverse proxy. Without considering items such as databases or storage.
Parallel Availability
to calculate: A = 1-(1-Ax )2
Serial Availability
to calculate: A = Ax Ay
There are tools and programs out there to help with this process. Some examples:
https://www.bmc.com/blogs/system-reliability-availability-calculations/
https://www.delaat.net/rp/2013-2014/p17/report.pdf
https://www.eventhelix.com/RealtimeMantra/FaultHandling/system_reliability_availability.htm
https://www.weibull.com/hotwire/issue79/relbasics79.htm
I must admit, this is not an easy topic if you go enough in detail. It can be a bit overwhelming in the beginning. But there is plenty of help, and once you practice a few times, it becomes easier.
@RobCollum , I am wondering, is there a tool or tip that you use or would recommend, to help calculations for complex SAS systems up time?
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.