Scaling CAS part 1: Add more workers with DAC

2 Likes

SAS Cloud Analytic Services (CAS) is designed from the ground up to harness the power of many resources over multiple machines to scale and attack the largest analytic challenges. But scaling CAS is not (yet) something that can happen elastically (that is, automatically). The starting point is to pick a size and shape for your CAS server deployment and get things running. From there, monitor the environment and adjust the resources for CAS accordingly. In particular, we can choose to scale CAS up (by adding more CPU and RAM to the existing host machines) or out (by adding more machines that host the CAS server). Either way, there are follow-on choices and further implications to consider.

The IAC and DAC projects

Typically we see SAS Viya running in a managed Kubernetes cluster hosted by an infrastructure provider. SAS offers optional tools to help achieve this objective. Two in particular are relevant for this post:

Infrastructure as Code

The Infrastructure as Code (IAC) project is intended to guide our customers in provisioning hardware appropriate for SAS Viya. Its use is optional but can be very helpful for getting started quickly for demos, proof of concepts, initial test environments, etc.

The IAC project is available in four flavors on Github depending on a site's infrastructure preference:

viya4-iac-aws: for Amazon Web Services
GitHub.com > sassoftware > viya4-iac-aws
viya4-iac-azure: for Microsoft Azure
GitHub.com > sassoftware > viya4-iac-azure
viya4-iac-gcp: for Google Cloud Provider
GitHub.com > sassoftware > viya4-iac-gcp
viya4-iac-k8s: for upstream open-source Kubernetes
GitHub.com > sassoftware > viya4-iac-k8s

Deployment as Code

The Deployment as Code (DAC) project is another offering from the IAC team. It is intended to guide customers with installing SAS Viya. Its use is also optional but it too is very helpful for getting started quickly.

The DAC project is available on Github, too:

viya4-deployment: for any environment that meets requirements
GitHub.com > sassoftware > viya4-deployment

If you have the IAC and DAC tools in your environment already, then they can streamline the configuration changes discussed here. Alternatively, you could use tools native to your site's infrastructure provider.

SAS Viya

The configuration of SAS Viya is extremely flexible to support a variety of infrastructure requirements. And the IAC/DAC projects surface many of those configuration options. By default, an out of the box deployment of SAS Viya might look similar to:

For this post we're primarily interested in the far left side of the illustration where CAS is shown. This picture shows CAS deployed in MPP mode (one logical server running across multiple hosts). The CAS pods are labeled for "cas" workload and they correspond to a node pool that's been set aside as labeled (and tainted) for "cas" as well.

The colorization of the CAS nodes is meant to visualize the min, max, and desired values for the number of active nodes which are part of the auto scaling group definition and used by the cluster autoscaler to bring up (or take down) nodes on demand. In this case, four nodes are green (desired=4) to reflect that we configured the installation of MPP CAS to run with one controller and 3 workers.

As a reminder, the default approach is for each CAS node to run just one CAS pod. That's because CAS is a high-performance, in-memory analytics engine that consumes substantial CPU and RAM to return results as fast as possible.

And finally, this illustration only shows a portion of what your SAS Viya deployment might look like when hosted in a managed cloud environment. Many other components are not shown here. And chances are your site will vary from this simplified example in several ways.

Scaling Up vs. Scaling Out

Scaling up means providing more CPU and RAM on the host - that is, to run on larger machines with a higher number of CPU cores and more RAM. To accomplish this in a managed cloud environment will mean changing the instance type for the CAS node pool. This will require terminating the current hosts and then replacing them with new ones which specify the desired instance type. In other words, scaling up by changing the instance type will necessitate an outage of SAS Viya.

Scaling out occurs when we add more host machines to process the workload. This can be accomplished with very little impact to end users as it typically means changing the desired number of nodes to a larger value (and the max as well, if that's the objective). While there's no service outage for SAS Viya, it can take several minutes for the new node to provision, containers to download, pods to spin up, and so on.

For an MPP deployment of the CAS server, both scaling up and scaling out are acceptable. We can add more workers to the CAS server and/or increase the size of each of the CAS server hosts.

Determining the exact instance type and number of nodes suitable for your site's workload are beyond the scope of this blog post. For official sizing guidance of SAS Viya deployments, we recommend working with the World-Wide Sizings Team. To do this well, the sizing effort is necessarily a collaboration between SAS and the customer. The Sizings team can help by asking and answering some critical questions as well as working with the customer to set expectations and devise a plan of attack to ensure scalability goals are met. For customers looking for help with getting the right hardware to run their SAS Viya workloads, your SAS account representative can begin an engagement with the World-Wide Sizings Team.

For the purpose of this post, we're starting with MPP CAS with 3 workers that's running on AWS instance type m5d.2xlarge (8 vCPU and 32 GB of RAM) in AWS.

Scaling out CAS with the DAC

Adding more workers to the CAS server can be done without an interruption in service of SAS Viya. This means more analytic computation and data capacity can be rolled out for better overall performance in a live environment.

We can use the DAC project to change the number of CAS workers. It's really easy:

Edit the installation variables file used by the viya4-deployment project to specify the desired number of CAS workers:
```
## Update MPP worker count to 4
V4_CFG_CAS_WORKER_COUNT:'4'
```

Direct the viya4-deployment project to run again and apply the change:

$ viya4-deployment --tags "viya,install"

# Note the 'baseline' tag is not needed for updating Viya config
# Also, in my environment "viya4-deployment" is an alias referencing the ansible utility with necessary parameters to run as a container.

If you follow along in OpenLens, then after the viya4-deployment is finished, you should see a new cas-worker pod started:
The alert triangle has a scary - but temporary - warning:
This happens because, even though the cluster autoscaler has fired up a new EC2 instance for this CAS worker pod, it's just not ready yet because it's still initializing:

Note: if a new node for the CAS worker is not initialized, confirm your auto scaling group hasn't reached its max value for node count. And also that the cluster autoscaler is running and behaving properly.
Be patient and eventually Kubernetes will notice the new CAS node is ready and then the CAS worker pod will finish its startup and join the CAS server.

Start up a CAS session in SAS Studio and it should report it's now using 4 workers:

80   cas;
NOTE: The session CASAUTO connected successfully to Cloud Analytic Services sas-cas-server-default-client using port 5570. The UUID       is e235fe0f-2b68-0548-bb8b-827242e2f484. 
NOTE: The SAS option SESSREF was updated with the value CASAUTO.
NOTE: The SAS macro _SESSREF_ was updated with the value CASAUTO.
NOTE: The session is using 4 workers.

Important to note:

Scaling out the number of CAS workers does not require a planned outage. However, as of August 2023, scaling in the number of CAS workers does.
After the new worker joins the CAS server, then existing data and sessions are unchanged. New CAS sessions created after scaling out will use all workers and new data loaded to CAS will distribute to all workers. It is possible to run code to redistribute existing data across all CAS workers.

Alternative: If your site is deployed or maintained without relying on the DAC, then SAS documentation explains how to manually change the number of workers for MPP CAS.

The upcoming Part 2 of this post will discuss more details about scaling up instead of (or also with) scaling out where we show how the IAC project can change the instance type of the CAS host machines.