Exploring Kubernetes Autoscalers for real-time SAS environments

2 Likes

I have recently had a number of conversations around ‘autoscalers’, Cluster Autoscalers and Horizontal Pod Autoscalers (HPA). There seems to be some misunderstanding on how these are used. So, I thought it would be a good time to think about these and what is supported with SAS Viya.

In this post we will discuss the difference between Cluster Autoscalers and Horizontal Pod Autoscalers. I will also look at what is required to define a HPA and discuss an example of using an HPA for SAS Micro Analytics Service.

But first some definitions, a Cluster Autoscaler will automatically adjust (grow and shrink) the size of the Kubernetes cluster (the number of underlying nodes) depending on following conditions:

There are pods that fail to run due to insufficient resources (this does not necessarily mean that all nodes are maxed out, as the pod scheduling is controlled by many factors).
There are nodes in the cluster that are underutilized for a defined period, it may be possible for the pods to be placed on another node that meets the scheduling criteria.

Where the Horizontal Pod Autoscalers, as the name suggests, apply to the Kubernetes (K8s) pods. In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload (pods) to match demand. They define the conditions for scaling (up and down) the number of pods replicas.

So, Cluster Autoscaler & Horizontal Pod Autoscaler are two independent features that do have a relationship when we think about the “elasticity” of the Kubernetes cluster, and the infrastructure costs (particularly if running on one of the Cloud Providers platforms). That is, the number of running pods and the HPA definitions can trigger the Cluster Autoscaler.

But what does this mean for SAS Viya?

If we look at the default deployment of SAS Viya, there is some redundancy, High Availability (HA) if you like, for the Stateful services (Consul, RabbitMQ, Postgres, Cache Locator and Server) with multiple pod replicas being configurated for these services. By default, the configurations for CAS (SMP is the default) and OpenSearch are not deployed with redundancy.

But all the Stateless services (including the web applications) have a single pod instance defined.

There is the Kubernetes transformer (enable-ha-transformer.yaml) that enables HA for the Stateless microservices. This provides two replicas for the Stateless microservice pods.

However, at this point in time, the Viya deployment doesn’t support deploying the microservices using a HPA definition with different values for the ‘Min’ and ‘Max’ number of pod replicas. This is because we do not set (define) the Kubernetes HPA behaviors. More research is required on all our microservices to understand their behaviors before doing this.

The SAS documentation states the following “By default, the Horizontal Pod Autoscaler (HPA) setting for all services is set to a replica of 1. If you want to scale up your services or pods to more than 1 replica, then the default HPA setting should be modified.”

To help you understand the SAS Viya deployment, below are a couple of handy commands. For example, to get the summary information for an HPA, in this case for MAS (sas-microanalytic-score), you can use the following command:

kubectl get hpa sas-microanalytic-score -n viya-namespace

You will see output similar to the following.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In the image you can see the ‘TARGETS’ field, it shows the current CPU utilization and the target utilization. You can also see that the MIN and MAX number of pods is set to 1, and there is only one MAS pod running.

To get more detailed information on the HPA you can use the ‘kubectl describe’ command. For example.

kubectl describe hpa sas-microanalytic-score -n viya-namespace

Below is the output for my SAS Viya deployment.

Here we can see that the CPU resource utilization is expressed as a ‘percentage of the pod requests’. Once again, we can see that the Min replicas and Max replicas is set to one (1).

So, when might we use an HPA?

At this point I can hear you say, “but I thought you just told us not to use custom HPA definitions!”

Well yes, but there might be a limited number of scenarios where this is useful. For example, workloads running in the SAS Micro Analytic Service (MAS) and Event Stream Processing (ESP).

Let’s explore workloads running on SAS Micro Analytic Service. The key thing to remember here is that all the models and decision flows published to MAS (maslocal) run in the pod. Unlike, SAS Container Runtime where there is only one model or decision per container image.

This affects the resources (CPU and memory) that the MAS (sas-microanalytic-score) pods need to run. The number of models and decision flows published will also affect the start-up time for the sas-microanalytic-score pods and the workload that they are handling.

Hence, this could be a good candidate for defining an HPA. Especially when we think about handling bursts of transactions.

But that might be a too simplistic view, as when the models and decision flows are embedded within ‘real-time’ business processes, high availability could be the primary driver, closely followed by latency (performance). Therefore, to meet the HA requirements you might deploy multiple MAS replicas and need multiple nodes for this workload. Remember, by default both MAS and ESP are defined as Stateless services, so will run with all the other Stateless pods.

Which brings me back to the Cluster Autoscaler. Scaling the nodes is not instantaneous, it can take a few minutes to get a new node. This is another key concern when designing the Viya platform to support the real-time processing.

Another consideration is that MAS is not a standalone service, the sas-microanalytic-score pod(s) are dependent on other SAS Viya services. Therefore, the MAS (or real-time) HA requirements will, or can, drive the need for an HA configuration for the SAS Viya environment.

Writing an HorizontalPodAutoscaler

In a former life before joining SAS, when modelling IT systems, we had a rule of thumb that burst traffic could be up to 20 times the average transaction rate. Think of your favorite retailer or airline making a “must have” offer that drives unprecedented demand.

The use of an HPA for MAS or ESP could be a good way to handle such peaks.

But this does drive the need for a deeper understanding of the application pods, including its resource requirements and how long it takes to scale up and be ready.

You also need to decide on the metrics (CPU or Memory utilization) and the threshold that you will use to trigger the HPA. This is all defined in the HPA ‘target:’ spec definition. You should also set the behaviors for the pod. This is used to define the rules around scaling up and down. It should be based on how long it takes for the pod to be ready to accept workload.

To put it simply, the HorizontalPodAutoscaler controller operates on the ratio between desired metric value and current metric value.

It is also important to understand that when a targetAverageValue or targetAverageUtilization is specified, the currentMetricValue is computed by taking the average of the given metric across all Pods in the HorizontalPodAutoscaler's scale target (defined in the ‘minReplicas’ and ‘maxReplicas’ definitions, see the MAS example below).

When managing the scale of a group of replicas using the HorizontalPodAutoscaler, it is possible that the number of replicas keeps fluctuating frequently due to the dynamic nature of the metrics evaluated. This is sometimes referred to as thrashing, or flapping. This is where the HPA behaviors definition comes into play.

Let’s look at some examples

Note, the generic examples below have been taken from the Kubernetes documentation, see the references.

Any HPA target can be scaled based on the resource usage of the pods in the scaling target. When defining the pod specification, the resource requests for cpu and memory should be specified. This is used to determine the resource utilization and used by the HPA controller to scale the target up or down. For example, to use resource utilization-based scaling specify a metric source as follows:

type: Resource 
resource:   
  name: cpu   
  target:     
    type: Utilization     
    averageUtilization: 60

With this definition (metric) the HPA controller will keep the average utilization of the pods in the scaling target at 60%. This is done by scaling up or down the number of pods within the bounds of the ‘minReplicas’ and ‘maxReplicas’ definitions.

Configuring scaling behavior

The ability to define behaviors was introduced with v2 of the HorizontalPodAutoscaler API. The behavior field is used to configure separate scale-up and scale-down behaviors. You specify these behaviors by setting scaleUp and / or scaleDown under the behavior field. Additionally, you can specify a stabilization window that prevents ‘flapping’ the replica count for a scaling target.

The following example shows defining a behavior for scaling down:

behavior:   
  scaleDown:     
    policies:     
    - type: Pods 
      value: 4
      periodSeconds: 60 
    - type: Percent 
      value: 10
      periodSeconds: 60

The periodSeconds indicates the length of time in the past for which the policy must hold true. The first policy (type: Pods) allows at most 4 replicas to be scaled down in one minute. The second policy (type: Percent) allows at most 10% of the current replicas to be scaled down in one minute.

When you define multiple policies like this, by default the policy which allows the highest amount of change is selected. In this example, the second policy will only be used when the number of pod replicas is more than 40. This is because the second policy is specifying 10% of the running pods, this value will only be greater than 4 when there are more than 40 pod replicas.

Setting the Stabilization windows

As previously stated, the stabilization window is used to restrict the ‘flapping’ of the replica count, when the metrics used for scaling keeps fluctuating. Hence, the stabilization window is used to avoid unwanted changes.

For example, the following snippet shows specifying a scale down stabilization window. In this example, all desired states from the past 5 minutes will be considered.

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300

Pulling this all together, here is a possible example for MAS…

Please note this isn’t a full worked example, which is another way of saying I haven’t tested it. 😊 Perhaps the HPA for MAS might look something like the following.

Let’s unpack this a little to see what I was trying to achieve:

I want a minimum of 2 MAS pods for HA reasons, but no more than 6 pods.
I should trigger the scale event on 60% CPU utilization.
There is no stabilization window for scaleUp events, but only scale 1 pod every 30 seconds. (You need to understand your environment to determine how long it takes for a MAS pod to be fully ready to receive workload).
I have specified two scaleDown policies and the minimum of the two should be used. The first is that no more than 2 pods can be removed in a 60 second period and the second is that 50% of the pods can be removed in a 60 second period.

In all reality I would probably just specify a single scaleDown policy with such a small number of replicas. But I wanted to show an example of using two policies.

Hopefully this example highlights the need to understand the MAS workload and understand how long the MAS pods take to start. Remember it will be dependent on the number of models that have been published.

Conclusion

In this post we have only just scratched the surface of understanding HPAs, it is a truly complex subject. But I hope I have highlighted the need for a deep understanding of Kubernetes and how your applications run (behave) to be able to properly specify an HPA.

While in the MAS example I have shown defining the HPA based on utilization, it is also possible to set the target based on a value. For example, on the number of milli-cores or cores have been used for CPU, or the amount of memory used.

Finally, I would recommend load testing to fine tune the HPA definition.

I hope this is useful and thanks for reading.

References

Kubernetes documentation: Horizontal Pod Autoscaling The high-level description and examples are based on the above Kubernetes documentation.

Find more articles from SAS Global Enablement and Learning here.

Angold5 · ‎08-15-2023

Is hpa implemented in the newer versions of SAS Viya or so we need to configure it ourselves ?