BookmarkSubscribeRSS Feed

Disaster Recovery for SAS Viya on Kubernetes

Started ‎06-11-2021 by
Modified ‎06-11-2021 by
Views 6,081

Hosting SAS Viya in the cloud will soon become more common than hosting it on bare OS, if it is not already. Some cloud providers and their partners offer tools and services to help you implement disaster recovery (DR) capabilities.

 

Plus, the fact that SAS Viya 2021.1 and later runs on Kubernetes introduces new considerations for DR beyond those we had for SAS Viya 3.x on bare OS. So while SAS Viya 2021.1 and later on Kubernetes is not significantly more or less vulnerable to disaster than any previous release of SAS, these two things merit an update on considerations for DR in SAS Viya.

 

The slide below outlines considerations that are specific to cloud hosting and to Kubernetes:

 

DR-1_Chapter-Slide-11-1-1024x576.png

 

 

SAS Viya on Kubernetes has specific technical requirements for Disaster Recovery. Your DR site's Kubernetes environment must have the same SAS Viya products at the same version of SAS Viya as your production site. The documentation says you must also use the same namespace name in the DR cluster, but we think this is more of a preference than a requirement, since it is quite straightforward to restore a Viya 4 backup to a different namespace name, and namespace names and DNS aliases are easily decoupled - while it may be common practice to have your Viya deployment's namespace name (production, test, gelcorp etc.) as a part of its DNS alias, it is certainly not a requirement to.

 

It must also be deployed on the same version of Kubernetes in the same cloud infrastructure, so failing over from one Azure AKS cluster to another Azure AKS cluster is okay, but we do not support failing over from an Azure AKS cluster to an Amazon EKS cluster as part of a DR process.

 

Disaster Recovery for SAS Viya on Azure Kubernetes Services (AKS)

 

DR-2_Chapter-Slide-12-1024x576.png

 

 

 

Microsoft Azure Site Recovery's best practices for DR in AKS recommend using their DNS-based load balancer Azure Traffic Manager to route traffic to either your primary or secondary AKS cluster, hosted in different Azure regions. It can interconnect two clusters to enable communication between them for data replication. It also has features for replicating container images between the Azure Container Registry in each Azure region where you have an AKS cluster, and the best practice guide above discusses several aspects of replicating storage. But it remains the technical architect's responsibility to figure out how to replicate SAS Viya state data between clusters in different regions in a way that will satisfy the customer's desired RPO and RTO. More on this below.

 

Start-up Arpio (see below) is also developing a capability for Azure, not yet released.  

 

Disaster Recovery for SAS Viya on Amazon Elastic Kubernetes Services (EKS)

 

Arpio



 

DR-3_Chapter-Slide-13-1-1024x576.png

 

 

Arpio is a start-up based in Durham, NC (near Cary) who offer security-conscious replication of a production AWS environment in one region to a second 'recovery' region. SAS Cloud is moving towards using Arpio for DR capability, but at this point it's fairly new to me. See arpio.io/how-it-works/ for their marketing.  

 

AWS CloudEndure

 

DR-4_Chapter-Slide-13-1024x576.png

 

 

Amazon AWS CloudEndure is their DR offering for EKS-hosted SAS Viya deployments. Documentation for using CloudEndure for EKS clusters seems to be light, but the general principles for what needs to be replicated and how traffic needs to be directed to whichever region is currently hosting your services is the same as in Azure.

 

The unique selling point for CloudEndure seems to be cost reduction, in that it uses low cost staging machines on the DR site when the production site is healthy. In the event of a disaster in Production it scales up the machines on the DR site to full size, ready for business use.

 

However, CloudEndure may not always meet the security requirements for some of our more data security-conscious customers.  

 

Google Cloud Platform and OpenShift

We have not so far identified similar services specifically developed to support Disaster Recovery for Google Cloud Platform or OpenShift. If you know of any, please tell me, as we'd like to cover them in our content.  

 

Further reading on SAS Disaster Recovery

In addition to Disaster Recovery practices outlined in the SAS Viya documentation, there is a SAS Disaster Recovery Policy for SAS Viya 3.4. There is not yet an SAS Disaster Recovery Policy specifically for SAS Viya on Kubernetes. Rob Collum's SAS Communities post on Contemplating disaster recovery for SAS from 19 July 2018 is also well worth reading.

 

Further considerations for DR

 

General

The resources listed above aim to explain:

  • That disaster recovery is not a feature of SAS software, it is a capability you must design, create and maintain
  • How to define a disaster in practical, measurable, objective terms which could potentially be instrumented for automated decision making
  • That resilience to disasters is based on eliminating single points of failure, that hosting on a single site creates such a single point of failure, therefore you need two sites BUT network latency means deployments that span sites will not work, so Prod/DR must be active/passive and synchronised

They also discuss how important it is that you agree with your business stakeholders:

  • What the risks and potential costs of a disaster impacting SAS services are to the business
  • If and how you will design your services to mitigate those risks so far as is practically possible e.g. through eliminating single points of failure for both storage and compute, load balancing and horizontal scaling for high availability as well as for performance
  • What you would do in the event of a disaster, in order to provide business continuity, and
  • What you would do following the disaster, to achieve full recovery to normal business operations
  • How you will keep a production site and a DR site in synchronization
  • How long a period of work and data loss can be tolerated
  • How quickly you must recover from a disaster, to ensure business continuity and what implication this has for automating the failover process to reduce delays
  • How often you will test your DR processes, and how much prior warning you and the business will get of a test. Should you consider something as extreme as Netflix's Chaos Monkey?

Hosting SAS Viya on Kubernetes is significantly more complex than hosting SAS Viya on bare OS. Backing up and replicating to another environment is correspondingly more complex too.

Data

Survey the data in YOUR SAS Viya environment which needs to be synchronised or replicated (e.g. by being periodically backed up and copied). These item are present in all SAS Viya deployments:

  • SAS Viya deployment directory (containing kustomize.yaml and site.yaml) and overlays
  • SAS Infrastructure Data Server (pg_dump)
  • SAS Configuration Server content
  • CAS permstore (it should not be necessary to back up CAS configuration files separately in Viya 2020.1 and later, since they are re-created at each CAS server startup)
  • Optional but recommended: log and metrics data to assist investigations into the circumstances of a disaster

Here are some examples of data which may be present in some deployments:

  • Kubernetes Persistent Volumes (PVs) used for mounting filesystem-based user data (e.g. NFS shares, user home directories), PVs for block storage, source data etc. If SAS ingests data from upstream data sources which cannot be obtained again, consider whether you should synchronise or replicate that to DR, or whether you need to keep upstream data available for a time in case the SAS Viya system has an outage and loses some data it pulled. If the upstream systems can re-supply data, it may be unnecessary.
  • Administration and operation scripts for the cluster
  • Mirror Docker Image repos and other deployment assets for all deployed software to allow (re-)deployment of the same release of software in both PROD and DR
  • Git repos used for e.g. CI/CD/DevOps tasks
  • Customizations to third-party logging, monitoring and alerting tools such as saved searches, dashboards, custom metrics, custom alerts, if not already stored in e.g. a git repo that is backed up and/or replicated
  • Any other customer or project data/scripts/assets stored with or in the production environment that aren't covered by the categories above

Each separate data type will usually require its own element of your overall approach to data synchronisation from production to DR, and you likely need to sync them all. Plan for the capability to synchronize data in both directions: following a disaster and failover to your DR site, that site temporarily hosts real business activity. When your main production site is back up and available again, you must be able to synchronize all these types of data from your DR site back to your main production site, and perform what is sometimes called a 'fail back' but better described as a switch of services back from DR to production.

 

 

Please share your experience of DR for SAS Viya with the community

I have found fewer resources, case studies and examples of partially or wholly successful implementations of disaster recovery capabilities than I hoped to find, given how important we know DR is for many of our customers.

 

If you have designed, implemented or operated a DR capability for a SAS Viya deployment I would very much like to hear from you. There is no substitute for real-world experience, and we want to hear about yours, good or bad. Please leave a comment below, or email me at David.Stern@sas.com to share your stories, designs or case studies on this topic. Thank you in anticipation.  

 

My thanks to Peter Muirhead, Gerry Nelson, Scott McCauley and Rob Collum for their review of earlier drafts of this post, and for their helpful feedback. Any errors or shortcomings are mine.

Version history
Last update:
‎06-11-2021 05:05 AM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags