Contemplating disaster recovery for SAS

2 Likes

Companies make significant investments in using SAS software. Not just licensing software from SAS, but also provisioning hardware, budgeting for support services, training personnel, and ensuring upstream processes can supply the system with necessary data. The enterprise's customers rely on the insights from SAS to make business decisions to run the company for immediate actions as well as long-term strategies. If something occurs which significantly interrupts the ability for SAS software to operate, then many sites will want an established process to get back into operational mode as soon as is reasonably possible.

Understanding how the SAS software operates as well as the business processes which SAS relies on is critical to devising a plan to recover from a disaster scenario when the continuity of normal day-to-day operations have been blocked.

Disaster recovery is not a feature of SAS software

There is no out-of-the-box feature of SAS software which provides the automatic ability to recover from a disaster scenario. However, SAS does offer plenty of features which can be used as part of a methodical plan to return business solutions into operation such as backups, migration, services failover, and more.

The point is that disaster recovery isn't a simple feature which is checked off of a list. Instead, preparing for disaster recovery requires planning in partnership with the customer and their IT team to devise a process which can be executed when needed. As it turns out, the first step of planning is ensuring all participants agree on what constitutes a disaster as well as recovery.

What is a disaster?

A disaster from a software services perspective can run a wide gamut and is based on numerous factors. In basic terms, a disaster occurs when software operations cannot resume in their normal mode of operation. So, for example, an extended power outage knocking out a customer's primary data center hosting the SAS solution would be a disaster.

A disaster is usually classified as something bigger than a single service failure. Normal availability operations - such as rebooting a host machine - would be sufficient action in these situations. Work with your customer to ensure that the definition of a disaster makes sense.

What is recovery from disaster?

Again, this definition will also depend on many factors. Basically, it's the process which brings SAS services online after a disaster event. This process might be completely manual, fully automatic, or some combination. Furthermore, executing all steps of a disaster recovery process could span across hours, days, or even weeks. The key criterion is that the process is well documented and understood. Ideally, it's tested and validated on a regular schedule.

Try to avoid conflating high availability and backup processes as being equivalent with disaster recovery. While those concepts (and others) play an important role, they are not sufficient to recover from disaster on their own.

Multiple physical sites

When considering improving the availability of a service, one of the primary goals is to eliminate any and all single points of failure. If all of your mission-critical data exists on one disk and nowhere else, that's bad. And we can extend this concept to other solution components like hardware (power supplies, network interfaces, racks) and software (clustering, failover, load balancing).

The entire data center where your customer's SAS solution resides is also a single point of failure - even if it's big and complex and filled with millions of discrete components. If a natural event like a flood or hurricane shuts down the data center, then the SAS solution is unavailable. So many organizations elect to operate multiple data centers in different geographic regions to guard against this problem. For example, Amazon Web Services offers Regions and Availability Zones so that virtual resources can be hosted in different parts of the world.

What to avoid for SAS in disaster recovery scenarios

Remember that a software's high availability features do not constitute a viable disaster recovery process. SAS solutions are comprised of a variety of different individual software products. Many of those products provide some form of clustering technology to improve availability (see 12 SAS Cluster Technologies).

We cannot use this kind of clustering to implement an automatic disaster recovery process. The clustering technologies employed by SAS solutions are designed around the expectation that the members of the cluster reside in near proximity to each other. Attempting to place some members far away in a different regional data center breaks this assumption and can lead to unexpected results. Don't do it.

Planning for SAS in disaster recovery scenarios

Instead of splitting a cluster of SAS services across sites, the preferred approach is to deploy separate, independent instances of the SAS solution at each site. The challenges then to address are:

keeping the SAS sites in synchronization
determining when the DR site should become operational

Neither of these is trivial. Synchronization will necessitate a process in place which copies updates from the primary site over to the DR site on some regular basis. And often customers prefer to have their hardware and software investments actively working - so they'll ask about having the DR site running as an active participant in production workload, not just sitting by passively until an unlikely disaster event may occur. Determining if that's possible, cost-effective, and desirable is yet another discussion.

Furthermore, discuss which pieces of the SAS solution are mission critical to business operations. Instead of fully deploying a completely equivalent DR environment, perhaps all that is required will be some select components which can run at a reduced capacity. This scaled down approach would likely reduce implementation and testing complexity, ultimately saving the customer a significant amount of time, resources, and money.

It's all about the data

Whatever approach is implemented for recover SAS services after a disaster, the data is what the users need. Without it, then the services are useless. There are many different kinds of data to consider:

Data mart: the data provided to gain insights, make decisions, and run the business
User data: the SAS users will have their own data - either inputs to the process or output from it
User content: reports, models, visualizations, etc.
Metadata: the data used to operate the environment, including descriptive information, authorization rules, and other ancillary items

Assuming a DR site with its own independent SAS deployment which is separate from the main SAS operations in the primary data center, then each of those data types will need a consistent and integrated approach to keep things in sync. It does little good to synchronize user content alone without the relevant metadata and data mart files that are equivalently up-to-date.

The simplest concept is to perform a full backup of the primary SAS site and then copy that backup to the DR site where it can be migrated/promoted into the other SAS deployment. In order to perform a coherent and complete backup, the primary SAS services must be offline since many services keep data active in-memory during normal operation. In other words, to prepare for effective recovery from unexpected long-term outages caused by disaster, we must implement regular, planned, short-term outages of the primary SAS site.

It's extremely difficult to achieve up-to-the-minute synchronization. It's far more likely that synchronization will only take place daily or weekly or even monthly. This frequency will depend on the site's tolerance for interim data loss weighed against service availability.

And be sure to plan for the migration of DR-hosted data back to the primary site after the disaster event is over.

End users in the recovery process

When a disaster recovery process is executed, we can expect that SAS solution services and data will effectively reside in a new location. The end users will need this new location conveyed to them in some manner. Consider client access to the end points provided by software components such as:

SAS 9.4 Metadata Server
SAS 9.4 Web Server
SAS 9.4 Object Spawner
SAS 9.4 Grid Manager
SAS Viya web-based applications
SAS Viya Cloud Analytic Services

The preferred approach is to implement DNS load balancing (or other equivalent) from the very first day of SAS software deployment. Done correctly with proper SAS configuration, then end users can be automatically routed to the DR site hosting SAS services with no changes to their own workflow. Alternatively, it's not uncommon to simply notify users of the relevant hostname changes so that they're responsible to direct the SAS clients to the new location.

In conclusion

SAS software doesn't offer disaster recovery as an out-of-the-box option. But with careful planning and implementation, SAS can integrate into a disaster recovery process, bringing the data and services online so that the customer can resume business operations with minimized interruption.

SAS publishes our Disaster Recovery Policy online for easy reference.