BookmarkSubscribeRSS Feed

Take care of your Viya storage before it takes care of you – Part 2: purge and expand

Started ‎06-08-2022 by
Modified ‎06-08-2022 by
Views 2,410

In the first part of this blog, we described the main storage areas of the Viya platform that are bound to grow, so you can implement proper retention settings and clean-up/purge automated processes.

 

But even if you carefully planned the various storage location capacities, you could always face an unexpected increase of the disk space used by the software, for various reasons.

 

Also, even if the space consumption behavior is normal but steadily increasing, you will need to know how to expand the available space for it.

 

If you detect disk pressure on one of your storage devices (whether it is an NFS server, Azure Disks, Azure Files or Google File Store) there are basically two things that you can do : purge unnecessary files and/or expand your storage capacity.

 

First we'll discuss a little bit about the “storage expansion” possibilities in Kubernetes, then we will see some “real life” examples about manual clean-ups and “last minute” storage size extensions.

 

File System expansion with Kubernetes

 

With Cloud Storage

 

From this Kubernetes blog there : "Block storage volume types such as GCE-PD, AWS-EBS, Azure Disk typically require a file system expansionrp_1_cloud-storage.png before the additional space of an expanded volume is usable by pods. Kubernetes takes care of this automatically whenever the pod(s) referencing your volume are restarted.

 

Network attached file systems (like Glusterfs and Azure File) can be expanded without having to restart the referencing Pod, because these systems do not require special file system expansion."

 

It means that if your Viya PVC is using Cloud "block storage", such as Cloud managed disks (typically used for our RWO PVs), we can expand the underlying managed disk size, but we will have to restart the pod(s) using it so they can take the new size into account.

On the other hand, when using Cloud storage services (such as Azure Files), such restart of the pod should not be required, and the available space could be extended just by changing the PVC storage value.

 

While the considerations above are generally true, it is always advised to look at your own specific Storage Classes implementation to know what happens when the PVC claimed size is exceeded.

 

With an NFS Server

 

rp_2_NFS-Server.pngThere is one important difference when you use an NFS server as the backend of your PersistentVolumeClaims (PVCs): the claimed size is not enforced as a limit.

 

For example, even though you have a PVC with an 8GB capacity for “sas-cas-backup-data“​, the files stored in underlying NFS share directory can use much more space than this without showing any issue or triggering the failure of the viya-scheduled-backup pod.

 

It is quite easy to see this difference, just run a "df -h" command inside the CAS container.

 

  • When using "Azure Files-based" storage class for all the PVCs:

 

rp_3_df-azurefiles.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

You can see that, for each mounted FS that corresponds to the PVs (sas-cas-backup-data, cas-default-permstore, cas-default-data, sas-commonfiles and sas-quality-knowledge-base), the available space in the Azure file share corresponds to the specified PVC Claim capacity.

 

  • Now, when using an "NFS Server-based" storage class for all the PVCs:

rp_4_df-NFS.png

 

For each mounted FS that corresponds to the PVs, the available space is the same and corresponds to the total size available in the NFS server share.

However, the disk capacity of your NFS server is NOT infinite.

 

So, when the NFS share is full it's "game over"…any pod needing more space (no matter what the associated Persistent Volume is) will fail and it will be the start of the end of your environment life 😊 .

 

If you have been using the NFS server for all your Viya PVCs and there is no more space left in the NFS backend server then the whole environment will be impacted (instead of only a specific set of pods relying on a specific PVC).

 

So how do we increase the size of the NFS server ? Well, it depends on the kind of NFS server and on the underlying disk capacity…but we'll see an example in the next section, in the case where the NFS server has been automatically provisioned by the IaC in Azure tool.

 

 

Some real life experiences

rp_5_FSError.png

 

Disclaimer : In the next sections, I’ll share some of our experience with storage issues and how we were able to resolve them (or not).

 

Please do not consider them as official instructions, they might work in your case, but they also might not work. Ideally you will have anticipated and planned your storage management strategy upfront (see Part 1 of this  series) and you will not find yourself in those situations….but we never know 😊.

 

 

Manual clean-up of the Postgres archive files

 

If you haven’t implemented the automatic postgres archive cleanup scheduled job (as explained in the first post), you might discover, one morning,  that the sas-crunchy PostgreSQL server is broken (which means that your Viya environment is broken….) and after some investigation understand that it’s because your underlying storage has been completely filled by the postgres archive files.

 

For the cleanup of the Postgres archives, you must be careful since the purge of the old archives can only be done AFTER a full postgres backup is executed. So, if you are already very short on storage, the solution can be worse than the problem…That’s why it is much better to have anticipated this problem with a scheduled cronjob (as explained in the first part of the post).

 

You can find the steps to “expire pgbackrest backups and WAL archives to release more space” in the SAS Administration guide.

Here is an example on how to do it.

 

Again, be careful because you need to have enough remaining space to run the first backup (and it can take a while), so use these instructions at your own risk !

 

NS=<viya namespace>

# Get the postgres-backrest-shared-repo pod name
PGBACKREST=$(kubectl -n $NS get pod -l "name=sas-crunchy-data-postgres-backrest-shared-repo" --no-headers=true -o name)
echo "pgbr pod:"$PGBACKREST

# show postgres backup info
kubectl -n $NS exec $PGBACKREST -- pgbackrest info
kubectl -n $NS exec $PGBACKREST -- df -h | grep "/backrestrepo"

# run a full backup BE CAREFUL : MAKE SURE YOU HAVE REMAINING SPACE
# IMPORTANT THE ARCHIVE CLEAN UP ONLY HAPPEN IF YOU DO A FULL BACKUP THEN SET EXPIRE
kubectl -n $NS exec $PGBACKREST -- pgbackrest backup --type full

# perform the cleanup
kubectl -n $NS exec $PGBACKREST -- pgbackrest expire --repo1-retention-full=1
kubectl -n $NS exec $PGBACKREST -- df -h | grep "/backrestrepo"

 

Important: after several issues with PostgreSQL dis space utilization, detailed steps to better maintain the PostgreSQL Cluster have been recently added in the SAS Documentation . It is strongly recommended that all customers implement a scheduled backup with retention policy to avoid the unexpected Disk Full error.

 

 

Manual clean-up of the Prometheus files

 

Here is an example of instructions that you can follow in Lens to release some of the space used (courtesy of my colleague @RobCollum  ).

 

  • In Lens, connect to the Viya cluster
  • Under Custom Resource Definitions, expand monitoring.coresos.com
  • Edit the CRD Prometheus
  • Find retention.time=7d and change to 2d
  • Find --storage.tsdb.retention.size=20GiB and change to 5GiB
  • Save and close
  • The pod prometheus-v4m-prometheus-0 will auto re-start
  • Validate by confirming those params have the new values in the statefulSet prometheus-v4m-prometheus
  • Validate by confirming those params have the new values in https://prometheus.HOSTNAME.race.sas.com/flags

 

The key thing here is to change the "retention" specifications of the Prometheus operator CRD (changing from 7 to 2 days of retention significantly reduces the disk space utilization...but also the monitoring history).

It can be done with just two commands :

 

kubectl -n v4mmon patch prometheus v4m-prometheus --type merge --patch '{ "spec": { "retention": "2d" }}'
kubectl -n v4mmon patch prometheus v4m-prometheus --type merge --patch '{ "spec": { "retentionSize": "5GiB" }}'

 

Thanks to the operator, the change takes place immediately and releases the associated storage space.

 

Manual clean-up of the Kibana Elasticsearch indexes

 

If you have deployed the SAS Viya monitoring and logging tools, you might notice at some point that the Volumes associated to the Kibana Elastic Search are taking a lot of space.

 

For example, it happens to us in our GEL shared environments. Usually, a maximum of around 10GB was used by the Elastic-search DB persistent volume, but one morning we detected that in one of our environments, much more space was used with a risk to fill the backend NFS server.

 

Fortunately, there is a way to clean up and reclaim some back immediately. You can use the Kibana console to see which indices are using the most space and manually remove them by submitting REST API requests. The detailed instructions to do it can be found in my personal GitHub repo.

 

Expand the Azure IaC provided NFS Server based storage

 

A colleague from the Risk Division contacted me a few months ago asking for help as the Long-term Viya Environment he had setup in Azure for his teammates was a little bit too popular and was running out of space.

 

He had used an automated deployment tool that was leveraging the IaC GitHub project with the “standard” storage option. The “standard” option means that for the SAS Viya PVC  he was using a dedicated NFS VM with several Azure Disks mounted and striped in a RAID 5 array providing the storage space.

 

By default, the IAC “standard” storage option gives you four 128 GB disks with a RAID5 configuration (which means you’ll get around 380GB of usable disk space for your SAS Viya Persistent Volumes). It might seem to be a lot…but trust me, if you leave your environment running for a little while, it will quickly be consumed 😊

 

To increase the space available in the IaC provisioned NFS server, we first stopped the Viya environment, then followed the Azure documentation steps to de-allocate the NFS VM, then increased the size of the attached Azure disks, but the trickiest part was to find the proper command to release the RAID stack and rebuild it with LVM (Logical Volume Manager).

You can find an example with all the commands we used in this GitHub repo.

Note that this operation is manual and requires an outage of the Viya environment.

Make sure you make all the appropriate backups before running them. You might also need to restart the NFS server and/or re-install the NFS provisioner, to ensure Viya pods pick up the changes.

 

 

Expand your Azure Cloud storage

 

It is also possible to expand the size of the PersistentVolume whether it is an Azure Managed disk or an Azure File Share. While I haven’t done it myself I think it should also be true with the other major Cloud providers (GCP and AWS).

 

During the last SAS Hackathon, my colleague Frederik contacted me as in one of the environments, the sas-crunchy Postgres server was broken because /pgdata was full and he wanted to know if I had experience with this problem. After a bit of review and tries we were able to increase the size of the Azure disk and restart Postgres so it could use it.

 

First, we had to scale down the SAS Data infrastructure Server (so the disk was not mounted any more), then we resized the disk in the Azure Portal UI, changed the associated PVC (Persistent Volume Claim) size accordingly and then we simply scaled back up the SAS Data infrastructure Server.

For the Azure Files, it is even easier, as explained in the Azure File CSI documentation there, “You can request a larger volume for a PVC. Edit the PVC object and specify a larger size. This change triggers the expansion of the underlying volume that backs the PV”

 

You can expand the PVC by increasing the spec.resources.requests.storage field.

 

For example, if your backup is failing because there is no more available space in the sas-cas-backup-data, you can run the command below to increase the PVC size.

 

kubectl -n gelenv patch pvc sas-cas-backup-data -p '{"spec": {"resources": {"requests": {"storage": "16Gi"}}}}'

 

If you are looking at the mountpoints from pod that, you can see that is instantly reflected in the mounted volume inside the pods and you can run the backup again (this time with success 😊).

 

 

Detect and clean orphaned PVC

 

 

In a Kubernetes environment, you might see “orphaned” Persistent Volume Claims (with associated Persistent Volumes) that remains even though they are not attached to any pod anymore.

 

For example, it can happen when you have an internal instance of PostgreSQL and you update your SAS Viya software. When the SAS Viya software is updated, the PostgreSQL cluster is terminated and re-created, but the PVCs for the previous PostgreSQL nodes are not deleted.

 

If you are using a Network File System (NFS) provisioner and all your PVCs are sharing the same underlying storage, too many orphaned PVCs can lead to a disk-full event and unexpected issues impacting the whole platform ​​​​​​(for the reasons explained above).

 

SAS published an official SAS note with instructions to detect and remove the “orphaned” PVCs.

 

Conclusion

 

As you probably know, the GEL team provides a lot of ad-hoc collections to support our various workshops (deployment, administration, data management, etc…). We use an "internally developed" tool to ensure that, during the collection’s machines bootstrap, everything (Docker, Kubernetes, monitoring, SAS Viya, etc….) gets automatically installed on the machines.

 

Those collections generally have a few days (sometimes a whole week) lifetime which is great for our purpose (provide Ad-Hoc pre-installed learning Viya environments), and if there is a problem with a Viya environment we can always restart a new one from scratch.

 

But what we, generally, don’t see in those collections are all the challenges to keep them working well for the longer term, fix them and maintain them (whatever happens in the environment).

 

And in the real life at customer site, restarting from scratch with a fresh Kubernetes cluster will not a be an option. That's why it is very important to understand and know how to address those long-term environment challenges.

 

What we discussed in this series might look like a pure administration/maintenance topic, however it ties back to the Architecture design : within the D30 document we can remind customers that some of the storage can be increased if the size exceeds predicted volumes. We can then point them to detailed steps as to how this can be done. Thereby we can give customers confidence to build Viya environments and focus on the usage.

 

I hope you’ve learned a few things here, thanks for reading !

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎06-08-2022 07:59 AM
Updated by:
Contributors

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags