Assessing SAS Viya Readiness

3 Likes

SAS Viya on Kubernetes is a complicated system comprised of over 140 pods and over 160 distinct services, and that is for a pretty simple deployment. When an administrator starts a SAS Viya deployment, all of these pods race to initialize and establish communication with the other components they depend on. So how is an administrator supposed to determine when SAS Viya is up and ready for users? This very issue was raised by one of the SAS Viya early preview customers so before SAS Viya was publicly released, SAS added the sas-readiness service to help assess the readiness of the deployment for work. The primary purpose of this service is to serve as a single contact point for administrators to determine when all of the SAS Viya services are ready to accept traffic.

What does 'ready' mean? (or...How long is a piece of string?)

One of the tough questions that had to be answered was, what does ready mean? Ask ten people and you will probably get eleven opinions. Does it mean there are a minimum number of services up and running for a user to logon? Does it mean a user can log on and run a batch job? Run a Visual Analytics report? Load data into CAS? The problem with approaching the question from a functional standpoint is that no two SAS Viya deployments are used in identical ways so what ready means for one may not mean the same for another.

For now, the sas-readiness service works under the assumption that if all services in the deployment are accepting traffic, we can presume that Viya is functionally ready and the system should be responsive to user input. Yes, it is a very granular approach to ready but using this standard has some advantages which we will look at shortly.

How does it work?

Every Viya service exposes an /internal/ready endpoint that returns an HTTP response code in the 200's if the service is ready to receive traffic - any other value is interpreted as 'not ready.' The sas-readiness service probes the /internal/ready endpoint using HTTP GET and checks the return code. Nice and tidy...and fast. Because the probe is so light weight, sas-readiness is able to re-probe each service every 30 seconds without overburdening system resources.

If sas-readiness detects any failed requests, it emits a single log message that reports on all services that responded with a failure code. For example, this is a message I captured during startup of one of my test deployments.

{
	"level": "info",
	"version": 1,
	"source": "sas-readiness",
	"messageKey": "readiness-log-icu.check.failed.log",
	"messageParameters": {
		"check": "sas-endpoints-ready",
		"message": "17 endpoints have no available addresses: sas-audit,sas-connect-spawner,sas-data-flows,sas-decision-manager-app,sas-device-management,sas-drive-app,sas-graph-builder-app,sas-job-execution-app,sas-lineage-app,sas-model-manager-app,sas-model-studio-app,sas-report-renderer,sas-score-definitions,sas-score-execution,sas-theme-designer-app,sas-visual-analytics-app,sas-workflow-manager-app"
	},
	"properties": {
		"caller": "checks/aggregate_ready.go:69"
	},
	"attributes": {
		"failedCheck": {
			"version": 0,
			"status": 1,
			"message": "17 endpoints have no available addresses: sas-audit,sas-connect-spawner,sas-data-flows,sas-decision-manager-app,sas-device-management,sas-drive-app,sas-graph-builder-app,sas-job-execution-app,sas-lineage-app,sas-model-manager-app,sas-model-studio-app,sas-report-renderer,sas-score-definitions,sas-score-execution,sas-theme-designer-app,sas-visual-analytics-app,sas-workflow-manager-app",
			"timeStamp": "2021-02-01T17:40:49.715387407Z",
			"name": "sas-endpoints-ready",
			"attributes": {
				"notReadyEndpoints": [
					"sas-audit",
					"sas-connect-spawner",
					"sas-data-flows",
					"sas-decision-manager-app",
					"sas-device-management",
					"sas-drive-app",
					"sas-graph-builder-app",
					"sas-job-execution-app",
					"sas-lineage-app",
					"sas-model-manager-app",
					"sas-model-studio-app",
					"sas-report-renderer",
					"sas-score-definitions",
					"sas-score-execution",
					"sas-theme-designer-app",
					"sas-visual-analytics-app",
					"sas-workflow-manager-app"
				]
			}
		}
	},
	"timeStamp": "2021-02-01T17:40:49.912356+00:00",
	"message": "The check \"sas-endpoints-ready\" failed - 17 endpoints have no available addresses: sas-audit,sas-connect-spawner,sas-data-flows,sas-decision-manager-app,sas-device-management,sas-drive-app,sas-graph-builder-app,sas-job-execution-app,sas-lineage-app,sas-model-manager-app,sas-model-studio-app,sas-report-renderer,sas-score-definitions,sas-score-execution,sas-theme-designer-app,sas-visual-analytics-app,sas-workflow-manager-app"
}

Even though 17 services were not ready, this one log message aggregates the information across all non-responsive services. Not only does this simplify interpreting results, it also helps to reduce the volume of log messages and keeps the sas-readiness response time lightening quick.

When the sas-readiness probe receives success codes from all known services, it does two things:

It again emits a single log message stating that all checks passed and the system is marked as ready.
The sas-readiness service reports its own state as Ready.

{
	"level": "info",
	"version": 1,
	"source": "sas-readiness",
	"messageKey": "readiness-log-icu.checks.all.passed.log",
	"properties": {
		"caller": "checks/aggregate_ready.go:79"
	},
	"timeStamp": "2021-02-01T17:42:20.136612+00:00",
	"message": "All checks passed. Marking as ready."
}

Even though the system has been deemed ready, the sas-readiness service continues to probe again every 30 seconds. However, once the 'All checks passed' message has been emitted, subsequent 'ready' results will not be noted in the log. In fact, no additional log messages will be emitted by the sas-readiness service until a failure is detected. This, again, reduces log volume and prevents redundant 'system ready' messages from appearing.

How can you get the readiness status?

As an administrator, you do not really need to go log spelunking to determine whether your Viya deployment is ready or not. Because the state of the sas-readiness pod itself reflects the readiness of the deployment, you can use the following command to have Kubernetes monitor the sas-readiness pod and let you know when the pod's condition has been set to Ready. In this example, I have asked to keep testing the condition for 30 minutes before giving up.

$ kubectl wait --for=condition=Ready pod --selector="app.kubernetes.io/name=sas-readiness" --timeout=1800s

If the timeout threshold is exceeded before the sas-readiness pod returns Ready, I will see this message which means that I need to decide if something is wrong or the system just did not come up within 30 minutes.

error: timed out waiting for the condition on pods/sas-readiness-7d487b9fd-v5rdt

However, the message I want to see is this one, which indicates that the sas-readiness pod is reporting Ready. And since that can only happen when the sas-readiness service successfully marked the system as ready, I can presume that my Viya deployment is ready for users.

pod/sas-readiness-7d487b9fd-v5rdt condition met

An administrator could also directly query the status of the sas-readiness deployment to see if readiness has been achieved. This technique would be a tad more convenient for scripting purposes as it will return a 0 or 1 depending on whether sas-readiness has marked itself as ready. Here's an example command:

kubectl get deployments sas-readiness -o jsonpath='{.status.readyReplicas}'

What components are not tested by sas-readiness?

As an administrator attempts to assess the readiness of their Viya deployment, there are a few things that are not addressed by the sas-readiness service.

The sas-readiness service runs from within the Kubernetes cluster so it does not go through the ingress controller to communicate with the other Viya services. Therefore, if there is a problem with the ingress controller, it is possible that users cannot access the system even though sas-readiness indicates the system is ready. Knowing this, the administrator can be on the lookout for this symptom and focus on the ingress controller if this occurs.
The sas-readiness service probes the services that are defined in the Viya namespace but has no way to determine if services that should be there are missing.
As mentioned earlier, sas-readiness does not perform any functional testing such as
- Can a user logon and obtain an OAUTH token?
- Can a connection to CAS be established?
- Can a job execute?
Ways to test these and possibly other functional tests is being considered and may appear in future releases. Including functional tests is a challenge because the tests typically take a bit longer to execute and would consume system resources to perform. Both of these issues would require running the suite of readiness tests less frequently.

So to wrap this up, Viya administrators now have a single point of contact to assess the readiness of a Viya deployment to accept traffic. And while there are certainly other facets of the system that may affect the readiness of SAS Viya, the sas-readiness services is a tool administrators can use to gain confidence that at least the Viya services are responding and should be ready for users.

Find more articles from SAS Global Enablement and Learning here.

Bueno · ‎03-30-2021

Good Call !