Viya not ready and Kubernetes complains about insufficient resources, but my nodes are idle!

4 Likes

Hi there !

I’ve experienced something very interesting this morning with a Viya 4 deployment (it was something that I knew but that I never experienced in such a visible manner).

You might already be aware or have seen that in your Viya environment, but I think it is key to understand how Kubernetes is managing the resources requests.

So, I thought I’ll write a very short blog (for once😊) to share !

The situation

Yesterday, I was playing with the Deployment Operator deployments for stable 2021.2.1 and I started the deployment of 2 environments/namespaces in my cluster. (first one from a Custom Resource file with the "user-content" stored in an external Gitlab, then one with an "inline" CR).

But this morning I noticed that the second environment was not up yet 😟 (although it has the whole night to complete).

After some investigation, it turned out that the culprit was the sas-crunchy-postgres pod who remained in a pending state (even after several restart attempts 🙂 )

In the crunchy pod event log, there was also a message talking about "insufficient CPU"

However, when I looked at the CPU usage (in Lens or running the kubectl top nodes command), none of my Kubernetes nodes seemed particularly busy on their CPU…

There seems to be sufficient room for more CPU utilization, wouldn't you agree ?

I thought "OK, maybe the sas-crunchy pod wants a lot of CPU ?"

So, I checked the sas-crunchy-postgres pod container’s resource requests.

But again it does not seem to be the issue...the pod's individual containers request are pretty low and the total amount of all containers CPU requests in the pod is less than 0.2 core.

So why does Kubernetes keep my sas-crunchy-data-postgres pod remain pending forever and complains about insufficient CPU ????

...

....

The answer

Well...The answer appears in the "cpu" aggregated requests reported in the nodes "Allocated Resources" table…when I saw it, I finally understood what my issue was...

Look at the CPU Requests on the nodes (the command to get the nodes allocated resources is simply kubectl describe node <nodename>).

Their value are near or equal to 100% for all the nodes that could accept the new pod.

So intnode04 is tainted to only accept CAS pods and the 4 other nodes (intnode01,02,03 and 05) are full in terms of cpu requests…it explains why my sas-crunchy-postgres pod can not be immediately scheduled by Kubernetes and stays stuck in aPENDING state.

If you are using Lens and have deployed the monitoring tool, you should also be able to see that there is a general "under sizing" issue with the sum of requested CPU almost corresponding to the number of available cores in the cluster (Allocatable capacity).

The total of all the CPU requests barely fit in the "allocatable" cluster capacity. We can also notice that much more CPU is requested than actually used (it could be if course different as soon as some users are starting to use the Viya products).

So as soon as I deleted the first namespace (removing the first Viya deployment), then the sas-crunchy-postgres pod magically changed its status from PENDING to RUNNING to READY.

So lesson learned for me : 40 cores is clearly not enough to deploy 2 Viya environments with a SAS Visual Machine Learning order.

Conclusion

So, even if my nodes CPUs were clearly not busy, Kubernetes was right ! the message "4, insufficient CPU" now makes sense (in terms of resource allocation).

Just like good old YARN, Kubernetes is simply keeping the count of all the containers resource requests for each node and make his decisions to allow or reject new pods based on that.

So please keep in mind that the Kubernetes scheduler does not make its decisions based on the actual real-time CPU or Memory utilization.

The implication of this interesting troubleshooting adventure is that it reminds us that Kubernetes is really a "reservation system".

The container resource requests really matter, if they are too different from the expected/observed real usage we could end up with a substantial underutilization of the infrastructure capacity… It will be important to keep that in mind, for example when defining custom resource requests for the CAS and Compute Servers.

touwen_k · ‎11-07-2023

thank you for this article, it is great explanation, particularly the comparison between Lens and kubernetes commands.