Nowadays, standing up a fresh new Viya environment is quite easy to do (especially with the all the automated tools such as the SAS provided "Infrastructure as Code" and "Deployment as Code" repositories).
What is more complicated, though, is to keep it in a healthy and performant state for weeks/months. This is because of factors like : different types and numbers of users, various workload and increasing amounts of data to be loaded and processed.
However, these are really the kind of things that our customers and partners will have to do if they are to gain value from SAS technology on a daily basis !…They cannot just throw away their environment when there is a problem and restart a new one from scratch 😊…They will need a long-term strategy to keep their environment up and running with Happy SAS users !
As more and more long-term internal demo environments have been built here and there (sometimes on-premise, sometimes in the Cloud) and as our team also started to build and maintain longer term "shared" collections, we've learned a few lessons in this area, that we wanted to share.
There are various challenges with these Viya LTEs ("Long Term Environments" as I call them), such as : dealing with frequent updates to remain supported, scaling up/down the environment to match the workload and optimize the Cloud costs, defining the best backup strategy, etc…
BUT one of the major concerns that surfaced is the backend storage management. For example, in our ad-hoc "short-lived" collections we never ask ourselves how to deal with situations where the NFS or local File systems are being filled up – but it's just because our collections don’t stay around for long enough 😊
In the LTEs, running out of space has become a very common issue and could lead to significant outages with broken and unusable Viya environments.
So, in here we’ll discuss and explore those kinds of very "real-life" scenarios that will very likely happen at our customer site after a few months (or week) of the Viya environment life.
This blog is broken down in two parts.
In this first part, we will review the storage locations in a Viya environment that often are the first ones to cause disk space issues. We'll explain why (and see what can be done to prevent issues).
In the second part of the blog, we will see what to do either when you detect that some storage areas are nearly full or when you want to extend the space capacity of your environment (either because it was initially undersized or for example to accommodate more data and or users).
In a Viya environment, there are several storage locations that are bound to grow.
For some of them, the growth is dependent on the business activities, requirements, and we can assume that the due diligence has been applied during the Architecture phase to size appropriately their capacity based on the user’s data and expected growth.
But for others, the disk space growth is mostly driven by the software applications and components standard behavior.
Let’s review some of them (it's not an exhaustive list !) and see how to prevent them to eventually break the environment.
One of the most frequent causes of the “disk full” error that we have seen in Viya environments ( where the default OOB internal PostgreSQL – aka “SAS Crunchy” - was used), is the fast-growing PostgreSQL WAL archives.
As explained by a SAS R&D colleague working on the PostgreSQL integration: “It is natural that, as time goes, the database size grows, and the underlying storage needs to be resized. However, the growth rate of WAL archive is MUCH MUCH faster than the database size growth because WAL archive is basically the flush of all the modified pages. WAL archive needs to be periodically truncated which is done by taking backups.”
If you are affected by this problem you can run the example commands below to list and estimate the overall size of these WAL archives:
# set the namespace
NS=gelenv-stable
# get the PGBACKREST pod name
PGBACKREST=$(kubectl -n $NS get pod -l "name=sas-crunchy-data-postgres-backrest-shared-repo" --no-headers=true -o name)
echo $PGBACKREST
# see all the archive files
kubectl -n $NS exec $PGBACKREST -- ls -l /backrestrepo/sas-crunchy-data-postgres-backrest-shared-repo/archive/db/12-1
# get cumulated size of all the archive files
kubectl -n $NS exec $PGBACKREST -- du -sh /backrestrepo/sas-crunchy-data-postgres-backrest-shared-repo/archive/db/12-1
Usually, the underlying storage space is shared between the "pgbackrest" repository (for archives and backup) and the PostgreSQL nodes (for user data), so if the disk full error happens, it causes all different kinds of issues, and the PostgreSQL cluster gets into the situation that is hard to be recovered without a data loss.
This problem can be prevented by a scheduled “pgbackrest” backup with the retention option.
It is documented in SAS’s Viya 4 operations guide (see the PostgreSQL Server Maintenance section).
While the current process to implement the schedule is a bit complex (you need to download and configure a client utility - called “pgo” for PostgreSQL client operator- then you have to forward the PostgreSQL Operator port on the machine where you run the “pgo” client tool, and finally can you submit the command that schedule the archives clean-up task in your PostgreSQL environment), it exists and there are plans to have it automatically implemented Out of the Box in future Viya versions.
A final note regarding the scheduling times for the PostgreSQL backup: the time in the pods is always in UTC, so be careful and take that into account !…
For example, if your server is configured to run on Eastern Time and your schedule looks like it will run every Sunday at 1am ("0 1 * * 0
"), it will actually run on Saturday evening at 9pm (because of the time difference with UTC).
In addition, using something like 1am every Sunday is a quite common choice for backup. That is fine, but just make sure this timing is not in conflict with any other scheduled activities that could be in conflict (such as a schedule reboot of the nodes ,the sas-viya backup or batch jobs for users.)
Finally, keep in mind that when a customer experiences such "Disk Full Error" issues with PostgreSQL, he should immediately contact SAS Tech Support. Because sometimes, things become more complicate when wrong things are done.
The Viya backup’s location is another area to monitor…especially if the content is growing... because it means the backup size will also grow !
Backups files are written to Kubernetes persistent volumes sas-cas-backup-data
and sas-common-backup-data
with respectively default volume claims of 8GB
and 25GB
.
Note that with these default values, you could quickly exceed the default claim capacity (the CAS tables created as part of a standard deployment already take almost 3GB
). So it might be a good idea (depending on the expected size of the CAS data stored in the PV) to change the default PVC capacity values.
The good news, though, is that (unlike the with the PostgreSQL automatic archiving) there is an automated "purge" job deployed out of the box.
The "Backup" and "Backups purge" tasks are implemented as Kubernetes cronjobs and by default:
So, by default we will usually see a maximum of 3 sets of backup files, but the customer can tune the cronjob as they want to meet their backup strategy requirements.
From what we’ve seen in Long term environments, the CAS tables size will usually take most of the backup space. For example, if you have a about 20GB
of CAS tables stored in your CAS Persistent Volumes, you can expect to use about 60GB
of disk space with the default backup cronjobs settings.
You might have installed the “SAS Viya Monitoring for Kubernetes” project in your Viya environment, which is great because it helps you to monitor the cluster resources utilization and availability. However, the 3rd party components (Prometheus, Grafana, Fluentbit, Elastic Search Kibana) that support the monitoring and logging functions also need and consume significant amounts of disk space...
By default, the "SAS Viya Monitoring for Kubernetes" is installed with its own Storage Class and dedicated PVC, but it is still important to estimate and configure the retention and size of the volumes associated to the various components.
For example, Prometheus which is the main Viya monitoring component, is configured by default to keep 7 days of history, up to 20GiB
of disk. However, it is something that you can easily customize by changing the default values in the user environment configuration file.
On the other hand, the indexes created by Elastic Search for the Viya logging (based on Kibana) can also use significant amount of space quickly. By default, the retention period is 3 days for the SAS Viya logs indexes, but it can be changed. Refer to see this page to see how to set the log retention policy.
Something else that will consume disk space are the SASWORK
files (created by the SAS Compute Servers). They are temporary files that are created either implicitly or explicitly by the end users when running SAS programs or using a software component that run some of them behind the scenes (like SAS Model Manager when you run an analytics pipeline).
The lifetime of these files is driven by the SAS Compute session duration. As soon as you close cleanly your compute session (for example by signing out from "SAS Studio"), they are automatically deleted.
In addition, at the K8s level, the SASWORK
files are directed to ephemeral storage by default (the emptyDir
volume is used by default in the SAS compute pod) . So when the pod’s life ends, the associated file in emptyDir
are removed automatically.
However, when SASWORK
is configured to point on persistent storage (like nfs
or hostPath
), there can be many situations where the SASWORK
files remain…for example when the launched Compute pod crashed (maybe it exceeded its memory resource limit and was removed by the kubelet) or is manually removed.
It could eventually result in filling up the non-ephemeral storage space with orphaned directories and become another burden on the Viya administrator to monitor and manage the non-ephemeral space used by temporary SAS files.
This issue has been reported and while "ad-hoc clean-up tools" have been implemented in the field by consultant, work is in progress to provide an official "cleanwork" utility in the next Viya versions.
Finally, something else that can take a lot of space in a Kubernetes environment is the accumulation of container images.
The container images are pulled just before the containers can be started and stored locally on each Kubernetes nodes.
As an example, in our GEL Shared environment, since we had the NFS server also located on node01
(of our 11 nodes collection), we were regularly running out of space because the NFS PVs (common backup, cas backup, cas-data, etc…) and the docker images were fighting for the disk space on the local node (which was quickly filled up) 🙁
To quickly release disk space, it could be tempting to run some "docker prune
" commands to get rid of unnecessary docker files. However this is risky and strongly discouraged, as by doing that you would "bypass" the Kubelet agent running on the node - which is supposed to manage the container images in a Kubernetes cluster.
Also, by default Kubernetes comes with an integrated Garbage collector that automatically cleans up the container images from unnecessary files after a certain Threshold of disk utilization has been reached in /var/lib/kubelet
.
The lesson learned here is that : the local container images space utilization on the kubernetes nodes is NOT something you should have to take care of, provided that 1) you have provisioned enough space to hold the Viya images (see Hardware requirement) 2) you are not using the local file system of the Kubernetes Node to store anything else than the default Kubernetes internal files (which is recommended)
So, as you’ve seen, in this blog, there are various storage areas to watch out in an "LTE" to avoid facing a situation where your Viya environment would not be "healthy" anymore.
However, even if you carefully planned the various storage location capacities, you could always face an unexpected increase of the disk space used by the software for various reasons.
Also, even if the space consumption behavior is normal but steadily increasing, you will need to know how to expand the available space for it.
These are the two points we discuss in Part 2: Purge and expand.
Thanks for reading !
Find more articles from SAS Global Enablement and Learning here.
Good article with a memorable title. I have a question about SASWork, when looking in site.yaml, what is the directory for it in order to perform the check ?
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.