In the GEL SAS Viya administration class, we finish the week with a troubleshooting section. In this section, we create problems in a SAS Viya deployment and ask students to fix them. In this post, I will cover the basic troubleshooting steps we follow. Following a general process for identifying and resolving issues can facilitate effective troubleshooting.
The diagram below shows a generic SAS Viya issue identification and resolution method. Initially, we capture the message from the interface or relevant details from the problem process’s logs. The next step involves identifying the servers and services contributing to the problem, which can be challenging without a solid understanding of SAS Viya's architecture. For example, if a report fails to open, potential causes could include issues with report configurations, authorization settings, identity management, or data-related problems.
After pinpointing the servers and services involved, the next task is to collect additional information. Examining messages preceding and following the error is crucial to gain context. We adjust logging levels for affected services to replicate the issue and gather more diagnostic data if necessary. Once sufficient information is collected to address the problem and the problem is fixed, it's important to reset logging levels to defaults to prevent system performance degradation.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
This process is fairly generic and could be used for any software problem. Let's examine the specifics of collecting more information about a problem in SAS Viya running on Kubernetes.
Troubleshooting issues with SAS Viya often involves examining Kubernetes resources using the Kubernetes command-line tool kubectl or a graphical tool like Lens. The diagram below shows key kubectl commands to identify the Viya services with the problem and collect more information. These will provide the basic information to start debugging a problem. The kubectl get, describe, and logs commands are shown here for pods but can be used for other Kubernetes resources. For example, you could also do a kubectl get pvc or a kubectl describe job. Let's look at what information you can get from these commands.
SAS Viya services run in Kubernetes PODS. One method to identify which PODS to focus on when researching a problem is to use the SAS Viya readiness service. The readiness service checks the status of the SAS Viya platform to determine whether it is ready for use. The service performs all of its checks every 30 seconds. After the software is deployed, the service can be consulted to determine if the deployment is ready for use or if there are PODS with issues. To do that, you can check the readiness POD status and for more detailed information, you can use the to identify services with problems.
Kubectl get
The first step in debugging is often to perform a kubectl get pods for all or some of the PODS in the SAS Viya deployment. The kubectl get command lists Kubernetes resources and their status. The output below shows a partial screenshot of a get pods on a Viya deployment.
Let's look at what information we can get from the output. The Status column of the output shows the POD status. Possible POD status values are:
Pending: pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run
Running: pod has been bound to a node, and all of the containers have been created. At least one container is still running or is in the process of starting or restarting.
Succeeded: All containers in the Pod have been terminated successfully and will not be restarted (the status for a POD that is part of a Job)
Failed: All containers in the Pod have terminated, and at least one container has terminated in failure
PODS with issues may have a status of Failed or Pending. Running is what you would expect from a functioning POD; however, it is important to note that a pod that is in a running state does not necessarily mean it is functioning properly. In the get pods output, the Ready column shows two numbers for each POD. The second is the total number of containers in a POD, and the first is the number of ready containers in the POD. For example, 1/1 indicates that all containers in the Pod are ready and running, whereas 0/3 indicates that none of the 3 containers in the POD are running.
We can see two examples of this in the output above where sas-cas-server-default-controller has a 0/3 and sas-cas-control has a 0/1. A status other than 1/1 or 2/2 can indicate a problem. Notice the status of sas-cas-server-default-controller shows Init:0/2. In Kubernetes, the status Init:0/2 for a pod indicates that the pod has two init containers, and neither of them has been completed successfully yet. (Init containers perform startup tasks in PODS, which must be completed before the main application containers start). The output here indicates there is a problem starting the CAS server. You can also use labels or the grep command to get the status of specific pods. Here are some examples of targeted get commands
Get the SAS Folders pod kubectl get pods -l app=sas-folders
Get all PODS with mining in the name kubectl get pods | grep mining
Kubectl describe
Once you have identified where the problem is, you can use the kubectl describe command to get more details about a POD. This command provides detailed information about Kubernetes resources, including overall status, status of individual containers, configuration, labels, and resource requests. The event history of the POD (at the bottom of the output) is particularly useful for debugging because it displays recent events and problems encountered during the resource’s lifecycle, which can help identify issues. This example does a kubectl describe of the CAS controller POD.
kubectl describe pod sas-cas-server-default-controller
Notice the event history shows a failure to mount a directory from an NFS server.
Kubectl logs
Of course, logs are a primary source of information for identifying and debugging problems. The kubectl logs command lets you access logs from your resources on a per-container basis or in aggregate. You can view a snapshot of currently collected logs, continually stream new log lines to your terminal, and access historical log lines emitted by terminated containers. Useful options on the logs command are: -f stream to the logs, -c to select a specific container, --tail to specify the number of lines to view at the end of the log and --since to specify a time period. Since have a problem with sas-cas-control, let’s look at its logs with the kubectl logs command using the label to identify the current POD.
kubectl logs -l app=sas-cas-control | gel_log
To debug the problem, we may need to raise log levels on Viya services. The process for adjusting log levels depends on the type of service. To adjust logging levels for most services, we can make changes in the SAS Environment Manager within the configuration area. For example, if we have issues with the identities service, we can set the logging level to "debug" to gather more detailed information. However, the process is slightly different when dealing with CAS and SAS Programming run-time. We still make the changes in the SAS Environment Manager, but we need to edit the XML for the specific service. If we do change the logging levels should be sure to reset the log level when we are done.
In our example, the problem is clear. The log shows that sas-cas-control is waiting for the CAS server to be ready. The two issues are related: the CAS server cannot start, and sas-cas-control is waiting for a CAS server. The root cause of this issue, where the CAS server cannot mount the NFS server directory, could be various. It could be networking issues, permissions on the mounted drive, etc.
kubectl events
What if we don't get a log? This can happen when containers in a POD fail to start or keep restarting. This makes it challenging to debug the issue. In these scenarios, you can use the kubectl events command to gather useful information. The kubectl events command returns information from the Kubernetes event log for the namespace. When Kubernetes components such as nodes, pods, or containers change state, they automatically generate events to document the change. Events provide key information about the health and status of your cluster. They inform you if container creations are failing and pods are being rescheduled. Information in events can help you troubleshoot issues. In this command, we look for messages relating to CAS using kubectl events
kubectl get events | grep sas-cas
The kubectl events command confirms the issue is with mounting a directory for the volume sas-viya-gelcorp-volume. Notice the sas-readiness POD is also mentioned. If we look at the readiness POD log, we can see that the services that are not ready all depend on a CAS server.
kubectl logs -l app=sas-readiness | gel_log
Another advanced but useful command we could use to further our troubleshooting steps is kubectl exec. The exec command lets you start a shell session inside a container. Once inside the container, you can perform tasks like inspecting the container’s file system, checking the state of the environment, and performing advanced debugging steps. Accessing the container in a shell window can be useful when logs alone don't provide enough information. In our problem with accessing the NFS server, we could further debug it by execing into the POD and running the mount command to collect more information on the failure.
Wrap Up
That is a high-level review of some useful troubleshooting steps in SAS Viya 4. In this post, we have focused on using the kubectl to collect information. Of course, you could also use SAS Viya Monitoring for Kubernetes, an open-source solution provided by SAS that supports issue detection, alerting, and investigation. It includes a logging area where users can search, filter, chart, and display log messages and a monitoring and alerting area to collect, display, and analyze metric data and manage alerts.
If you want to test your troubleshooting skills, check out the last chapter of the GEL SAS Viya Administration class. Finally, the SAS Viya documentation has some great content about troubleshooting, I have included links below.
Documentation: SAS Viya Platform: Troubleshooting
Documentation: Readiness Service
Documentation: Logging
Feel free to add any tips and tricks you have found useful in this post's comments. have a problem with sas-cas-control, let’s look at its logs with the kubectl logs command using the label to identify the current POD.
Find more articles from SAS Global Enablement and Learning here.
... View more