Detecting and shutting down applications running Kubernetes

Editor's note: The SAS Analytics Cloud is a new Software as a Service (SaaS) offering from SAS using containerized technology on the SAS Cloud. You can find out more or take advantage of a SAS Analytics Cloud free trial.

This is one of several related articles the Analytics Cloud team has put together on while operating in this new digital realm. These articles address enhancements made to a production Kubernetes cluster by the support team in order to meet customer's application needs. They also provide guidance through a couple of technical issues encountered and the solutions they developed to solve these issues.

Articles in the sequence:

How to secure, pre-seed and speed up Kubernetes deployments

Implementing Domain Based Network Access Rules in Kubernetes

Detect and manage idle applications in Kubernetes (current article)

Extending change management into Kubernetes

Detecting and shutting down applications running Kubernetes

While developing SAS Analytics Cloud, there was an ongoing challenge to balance application performance against efficient hardware usage. Under certain conditions, we observed artificial resource starvation in our Kubernetes clusters. With the system executing no real work, Kubernetes was unable to schedule more tasks. To address these problems, SAS Analytics Cloud has adopted a repeatable pattern detecting idle applications reserving resources and shut them down to release those resources.

With the multi-fold advantages of cloud computing, it’s no wonder why many businesses are migrating applications en masse. Many companies have started the movement by adopting the lift-and-shirt approach and free up old infrastructure. While this doesn't magically make an application cloud-native, it allows developers to iteratively improve the applications, making use of cloud-native features and techniques.

As Kubernetes cluster operators, we welcome developers to join our utopian world of self-healing, maintenance during business hours, and efficient use of resources. However, we must balance the stability of our clusters with applications not yet fully cloud native. In this article, we’ll describe a pattern we've developed to automatically start and stop applications running in Kubernetes, without requiring modification to the application code.

Problem with no canonical solutions

For one of our primary platforms, we provision per-user copies of an application with a substantial footprint. The resource usage is quite low when the application is not active. However, user work causes substantial spikes in CPU and/or memory usage. To ensure performance and stability for each user, we use Kubernetes’ resource requests mechanism with high values. While this ensures stability, it may impede Kubernetes’ ability to schedule a new pod in a cluster doing very little.

Unfortunately, most cloud-native approaches to resource management were a poor fit. Lowering the request values runs the risk of poor performance on crowded nodes (i.e. noisy neighbors). Horizontal Pod Autoscaling requires a fully cloud-native application, which is beyond the control of cluster operators. Vertical Pod Autoscaling requires pod restarts which can be disruptive to user workloads and may not work for some resource usage patterns.

With no upstream solutions identified, we developed a pattern to detect idle pods and shut them down. We further detect when traffic streams to a non-existent application and start it dynamically.

Components and requirements

To effectively solve the problem, we first enumerate relevant properties of the components and our operational environment.

The application workload is user driven.
The application starts up relatively quickly, from the perception of a user sitting in a browser.
The work originates outside the cluster, accessing the application via a Kubernetes Ingress.
No application modifications by the operations team.
The application provided to the operations team via a container image.
While you cannot modify the code, the operations team can “wrap” the container image with additional layers.
The operations team specifies the Ingress Controller owning the application’s Ingress.
When traffic flows to an application we’ve shut down, we start it without any explicit out-of-band actions.
No explicit requirement for applications with a high CPU/Memory footprint. However, they are high value targets justifying the additional work and complexity.

By disallowing code modification, we target applications in early phases of “lift-and-shift", allowing developers to establish their own timelines for updating the applications. This requirement also allows targeting party applications where the code is forever beyond reach. Requiring additional layers may be avoidable but it does afford a desirable level of observation and control to the operation teams. Requiring an Ingress heavily informed our approach and yields unavoidable in this described solution.

Detailed workflow

The problem bisects into two independent tasks: shutting down idle applications and starting applications as needed.

Detecting idle applications and shutting them down

The first step to determining idleness is adding metrics about the application. Our targeted application uses Apache which we used as a hook for metrics generation. In a new Dockerfile, we start with the provided application image as a base, adding /etc/httpd/conf.d/metrics.conf along with a custom script. The configuration file specifies a logging module that will pipe formatted content to our script. This approach easily adapts to other web servers and proxies. An alternate approach is tailing existing logs from any source, though this presents less room to control to and interpret the format.

As the script runs, it parses the logs and serves pertinent Prometheus metrics. After filtering out some noise, a simple gauge-type metric for “seconds since last hit” fit our needs perfectly. Other applications may require more complex metrics.

With raw metrics available, we need a component to consume them and apply some logic. In addition to the large application we are targeting, we provide a core services pod in the namespace. An additional service is added, polling the metrics endpoint and comparing the response against a configurable threshold. If the “seconds since last hit” is greater than “seconds until idle”, the service scales the target application threshold to 0 replicas. You could adapt this technique to run as a sidecar or in the metrics-creating script.

Detecting user traffic and starting the application

Our solution relies on NGINX ngress Controllers managing application Ingresses. In the configuration for the Ingress Controller requires two specific CLI flags: --default-backend-service and --configmap. The first flag allows specifying a service that functions as a catch-all service for traffic that can’t get to its destination. The configmap uses the property custom-http-errors, determining the HTTP response codes to intercept. For this solution we need 503 to be in that list of response codes.

When ingress traffic streams towards a shutdown application, NGINX determines the backing service and typically responds with a 503. However, with the configurations specified, the traffic collects additional headers and transmits to the custom backend. Using the headers, the backend infers the targeted application and begins a workflow to start the application and redirect traffic to a holding page. The holding page polls the back-end service periodically until it is available, and finally redirects the user to the page.

For applications that start quickly, it is possible to handle the redirect without the holding page. While not yet explored, it might also be possible to reassemble the original request and let it the newly launched application handle it as normal. Both scenarios are highly dependent on the nature of the application.

Sample code

The following is a simplified version of the default backend endpoint, within a python flask application. For clarity in the code snippet below, I've included parameterized elements in-line.

@app.route('/', methods=['GET', 'POST']) 
def index(): 
    headers = request.headers 

    // Only handle "Service Unavailable" scenarios 
    if 'X-CODE' in headers and headers['X-CODE'] == '503': 
        try: 
            core_api = client.CoreV1Api() 
            apps_api = client.AppsV1Api() 

            // Verify we should be working on this namespace 
            namespace = core_api.read_namespace(headers['X-Namespace']) 
            if namespace.metadata.labels.get('sas.com/acloud-type', None) == \ 
                'acloud-project': 

                // Check for target deployment. 
                deployment = apps_api.read_namespaced_deployment( 
                    name='acloud-application-name', 
                    namespace=namespace.metadata.name) 
 
                if deployment.spec.replicas == 0: 
                    t = threading.Thread( 
                        target=launch_environment, 
                        kwargs={'namespace': namespace, 'deployment': deployment}) 
                    t.start() 
  
                // Redirect user to a holding page 
                url = 'https://{}.{}/startup?app={}'.format( 
                    SUBDOMAIN, 
                    DOMAIN, 
                    namespace.metadata.labels['sas.com/acloud-name']) 
                return redirect(url)

Conclusion

By implementing this idling pattern, SAS Analytics Cloud ensures a minimum performance level without excessive risk of artificial resource starvation. The flexibility of the pattern has allowed us to apply similar logic to multiple applications. The ability to automatically relaunch the application when traffic is detected has minimized the visible customer impact. Overall, the technique has proven useful and will continue with new applications as they are available in SAS Analytics Cloud.