Scalavailability of CAS

3 Likes

Yeah, I made that word up. As you've probably already surmised, it's a portmanteau of "scalability" and "availability". And their consideration within the architecture and deployment of SAS Cloud Analytic Services are frequent topics of interest. And like most things SAS, we have a wide variety of options regarding its deployment and operation.

CAS has its roots based on previous highly scalable in-memory analytic engines. It's optimized to run fast in a wide-range of scenarios and to scale up and out to tackle the largest analytic problems at incredible speed. Beyond scalability, it also provides a number of features to provide improved availability of data and service over its predecessors.

Where do scalability and availability of CAS overlap and reinforce each other? Let's talk about it.

Foundation Concepts

This is a quick run down of the features of CAS relevant to this post:

Scalability:

Two modes of operation: SMP (single machine) and MPP (multiple machines providing a single service)
The minimal MPP configuration is 1 Controller and 1 Worker - but that's inefficient and should be avoided (use SMP instead). Real world minimum should be 1 Controller and 2 Workers.
Add more Workers at any time without interrupting the system. Converting from SMP to MPP requires an outage.

Availability:

Blocks of data can be copied amongst the workers to protect against the loss of a node (more copies == toleration to support loss of more nodes)
A second CAS Controller can be enabled to improve its availability - but it's not a full peer and recovery to full service status requires an outage.

Now there are caveats and provisos in certain situations to many of these aspects of CAS… but they convey the attributes we want to look at in broad strokes.

Costs

We should never lose sight of costs. When it comes to scalability, the costs are often easier to justify. If a particular analytics problem takes too much time, needs more resources, or is otherwise too large, there's a good chance that it can be corrected by throwing more ~~money~~ CPU and RAM at it for CAS to work with. Getting necessary work done is often a sufficient justification for increased cost.

Availability is sometimes harder to justify cost-wise. Effectively it's an insurance policy. You hope you won't need it and you don't want to pay for it, but at some point, you have been (or will be) glad you did.

Remember above where we said that we can keep extra copies of data in CAS to help survive the loss of a Worker? Where does that data actually go? On local disk of each CAS Worker. Which means each CAS Worker must have local disk to use for that purpose (not always a given in today's cloud-provisioned infrastructure). More copies = more protection for more Workers = more disk space = more cost.

And then there's the Secondary CAS Controller. When it's operating in its secondary role, it doesn't do much but acknowledge the activities of the primary Controller, ready to take over when needed. And from that perspective, it's an ongoing cost with a possible future benefit that might not be directly realized. Cloud costs for services add up by the minute (or even second), so when does it make sense to standup a secondary Controller for CAS? Usually the decision balances on the cost of loss of work, which often times is correlated to the size of the CAS cluster (i.e. bigger clusters do more work for more people).

Scalavailability

So I've been kicking this idea of scalavailability regarding CAS around for a while. I want a simple set of rules that I can follow as a happy path to achieve the right balance of scalability and availability when they're most likely to complement each other.

We can literally deploy CAS almost any way we can think of - including in configurations which we shouldn't because they're inefficient, unwieldly, contradictory to its design, and so on. Of course we want to avoid those, but what about others that, while they are technically correct, don't quite satisfy the idea of scalavailability?

My premise here is that as the number of nodes in the CAS cluster increases, the likelihood is that the loss of its service would be a greater impact and we should have a standard approach to reconcile that.

Scalavailability of the CAS Controller

When should you employ a Secondary CAS Controller? It's optional after all. You could just ignore it and never include one. Or you might reason to include it in all of your CAS deployments.

For CAS, my scalavailability rule is this:

(number of CAS Controllers) < (number of CAS Workers) / 2

Let's apply this rule to CAS deployed in scenarios for 1 - n hosts:

Hosts	Controller(s)	Worker(s)	Mode	Recommend?	Reason
1	-	-	SMP	YES	For smaller implementations
2	1	1	MPP	NO	No, this is inefficient, use SMP
3	1	2	MPP	YES
4	1	3	MPP	YES
5	1	4	MPP	YES
6	1	5	MPP	YES	Borderline - but using 2 controllers with 4 workers would be "equal to" not "less than" as per the proposed scalavailability rule
7	2	5	MPP	YES
n	2	(n-2)	MPP	YES

So, based on this rule, when you approach the point of running 5 Workers, that's when you should consider adding a Secondary CAS Controller if availability concerns warrant and budget allows.

Of course you can come up with your own calculation. Think of this one as a line in the sand. Now that you know where it is, you can decide which side you want to be on.

And of course, there will be some circumstances where the need for a Secondary CAS Controller will be obvious (or the lack thereof) regardless of the actual number of Workers. So accommodate that in your approach as well.

Scalavailabilty of Data in CAS

Did you know that the more Workers you add to MPP CAS, the greater the likelihood that one will fail? Each Worker has some small chance of failing due to hardware, software, or other issues. Additional Workers multiply that chance. This isn't just a CAS thing… it's a known consideration for RAID arrays and why a large number of disks in a RAID-0 striped array isn't really a good idea if the data kept there is important, even for a limited lifespan. *looks sideways as SASWORK*

Data in CAS is distributed across the Workers in blocks. If there's just one instance of the data, and one of your CAS Workers goes offline (taking its allotment of data with it), then you no longer have a complete data set.

The COPIES= parameter directs CAS to make copies of blocks of data. With COPIES=1, there's the original in-memory (hopefully) instance of the data plus one more copy stashed away in the CAS disk cache. The copy is distributed such that a complete instance of the data will be available if one Worker goes offline. Increase the number of copies and CAS can tolerate the loss of more Workers.

But really, how likely are we to deal with the loss of multiple Workers? Well remember that we're not just talking about CAS Worker containers in their pods in a Kubernetes cluster, but also with those running in virtual machine instances that reside on physical computer systems racked together with power supplies, storage systems, network hubs, etc. in a data center. A failure at any of those points could take out one or more CAS Workers. But don't freak out yet.

Generally, the default is for COPIES=1 when CAS is working with non-native data from remote sources (or COPIES=0 for memory-mappable unencrypted SASHDAT). And, in the real world, for the most part, COPIES=1 is very sustainable adding a relatively low overhead and providing a huge benefit. It's a good default regardless how many CAS Workers you're running.

Even so, for data in CAS, I do have a scalavailabilty rule I'm considering:

copies = ROUNDUP [ ( number of CAS Workers ) / 10 ]

Simply put, for every 10 Workers, increase the number of copies by 1. This should keep the overhead of storing additional copies of data acceptably low and still provide improved availability of larger CAS deployments.

Workers	Mode	COPIES=
1	SMP	0
2-10	MPP	1
11-20	MPP	2
21-30	MPP	3

Like before, your mileage may vary here. It's certainly possible to have scenarios where you might want zero copies regardless of how many CAS Workers there are. Or that you might need more copies. The point is to start somewhere with considering the possibility and your customer's requirement, then acting accordingly.

Coda

When do we reach the tipping point that drives us to make a decision in one direction or another? When it comes to the architecture and deployment of SAS Viya, there are many decisions - and often many such tipping points. The simple rules I've illustrated here are meant to be examples, not to be taken as The Right and Only Way. But I find often that codifying common scenarios with such rules can help navigate the complex set of interdependent decisions necessary to achieve success with SAS Viya. I hope this works for you, too.

ronan · ‎08-27-2021

Very useful article, thank you for sharing it.