The cost of the Cloud infrastructure has become a critical factor.
At the beginning, the standard pricing model in the Cloud was the "Pay-As-You-Go" one (cloud services are billed per actual usage, typically with a "per hour" fixed price).
The nice thing about this model is that you only pay for actual usage and can scale down resources when needed.
But Production environments often need to remain available 24/7 (for example to provide Analytics capabilities to all users around the world, all the time) and Cloud providers now offer different and more advantageous pricing models for this use case, such as "Prepaid/Fixed Subscriptions" (where cloud customers pay for services upfront) or "reserved instances".
However, the "Pay-as-you-go" might still be used for a demo or lab Viya environment that don’t necessarily need to be up during the night or kept around all the time. You might also want to keep environments for "bursty" and batch scenarios that only runs a few days during the month.
But in such case what would be the best way to scale down the resources and reduce your costs ?
In this article we'll look at a nice Azure feature that allows you to stop the entire AKS cluster where Viya is running and restart it later when needed.
If you have a dedicated cluster for SAS Viya and have configured node autoscaling in your Kubernetes cluster you have already improved the cost efficiency of your environment.
In such case, when you stop the Viya services, your infrastructure should automatically be scaled down to the number of minimum nodes in each of the node pulls.
For example, if you provisioned your cluster with the viya4-iac GitHub tool and kept the default "node pools" settings :
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Once you stop all your Viya services (there is a specific Kubernetes cronjob for that), then Kubernetes detects that there are no more resources requests from the Viya pod and triggers the Cloud autoscaler to decommission the nodes until their number is equal to the configured min_node
value in each node pool (0 in this case for the Viya node pools).
Several SAS runtimes (like ESP or the compute server) can also benefit from the auto-scaling to keep only the number of nodes required for a given number of pods that corresponds to the user’s requests and hence provide true cloud elasticity.
For example, we could set the Compute Nodes range from 2 to 10 so the system would start with 2 nodes but could accommodate an increase of the demand with extra SAS compute sessions corresponding to additional container resources requests (expressed in CPU and memory) in the Kubernetes cluster and trigger the Azure autoscaler.
That will already save a lot of Cloud money because we shrink down the computing resources when we don't need them.
Note that in our example we have defined the autoscaling settings during the initial provisioning of the AKS cluster (with Terraform) BUT not all is lost for those who did not have auto-scaling enabled by default during the initial setup : auto-scaling can be enabled after the fact by running a command similar to the following in the Azure Cloud Shell :
az aks nodepool update –enable-cluster-autoscaler –cluster-name viya-1-aks –resource-group viya-1-rg –name compute –min-count 0 –max-count 5
Where viya-1-aks is the cluster name and viya-1-rg its associated resource group. In this example, the command only affects the "compute" node pool.
However, even when the autoscaling is enabled and all the nodes are scaled down, the Kubernetes system pool (as well as the associated network components) remain active and continue to generate costs every hour.
A cleaner and more complete way to reduce the bill of the Kubernetes cluster in Azure, during a known period of inactivity, is to use the "AKS stop/start" feature.
When you use this feature, it is like "pausing" a video and resuming sometime later.
Note that the “stop” button is now also available in the Azure portal (However using the az CLI command makes the automation and scheduling of the stop/start process much easier).
Behind the scenes it leverages the fact that AKS is already backing up the cluster state for resiliency. the only state in the Kubernetes system is really the contents of etcd.
As noted in the official azure documentation there are some limitations, such as :
However, it remains a very handy feature and, according to our tests, it works well with the Viya environment.
Here is an example of the process (that can easily be automated):
az aks stop
command to pause the clusteraz aks start
command to restart the cluster
Assuming any in-memory data (such as CAS output tables) has been properly saved, there is no need to take extra backup or to stop the Viya services before running the AKS stop command.
Several methods have been experimented by our SAS colleagues from various teams to keep the AKS costs down when not used (using node autoscaling with 0 nodes for all the node pool, automating the stop/start of VM scale sets, etc…). However it seems like the AKS stop feature is the simpler and most efficient way to do it.
But the Kubernetes cluster is rarely the only Cloud Infrastructure piece used by the Viya environment.
Stopping the AKS cluster will not automatically stop the Jumphost or NFS server VMs and "satellite" components like the NetApp Storage services or the Azure Postgres database will likely generate significant costs.
So, a good practice would be to implement and test a true CI/CD process to also automate the stop of these services whenever possible, when the Viya environment is not used (for example the standard Azure Postgres database cannot be stopped but the flexible server can ! and work is in progress to officially support it in the Viya 4 IaC tool in the future).
Finally, this capability is quite unique with Azure. For the other Kubernetes Managed Services (such as GKE or EKS), you can't really stop the whole Kubernetes cluster like this, as the master nodes / control planes are directly managed by the Cloud providers (AWS or GCP).
Find more articles from SAS Global Enablement and Learning here.
Hi @RPoumarede
Is there anything equivalent on AWS (EKS)?
We have setup SAS Viya 4 on EKS and when I run the "sas-stop-all" job then all SAS pods are killed and the node pools are closed (they have minimum servers = 0) but the EKS cluster is still running and the "default" node in the cluster is still running the system pods like ingress controller and other EKS stuff and we are charged for this server... Any ideas?
Thanks,
Eyal
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.