In SAS Viya Monitoring for Kubernetes, metric data is gathered by Prometheus and displayed in Grafana. As this blog post explains, Prometheus will keep metric data until either the amount of data it has exceeds a specified retention size, at which point it will begin deleting the oldest data to make way for new metric data, or until that metric data exceeds the specified retention time. If the retention time limit is reached first, the metric data older than that limit will be deleted even if the total size of data retained is smaller than the retention size.
In this post, we'll see how to calculate how much space is being used to store Prometheus metric data in your cluster, and how to change both the retention period and retention size.
Incidentally, my colleague Raphaël Poumarede (@RPoumarede) has written here recently about managing many aspects of storage, in his posts Take care of your Viya storage before it takes care of you – Part 1: Planning ahead and anticipating and Take care of your Viya storage before it takes care of you – Part 2: purge and expand. They are fantastic posts - do read them!
The method for seeing how much storage is available and how much is actually being used depends on the type of Storage Class used for your Kubernetes Persistent Volume Claims (PVCs). In our internal GEL workshop environments we use a simple NFS share, but in a production environment you should use something better, like Azure Files or S3.
Note: NFS storage is not ideal in a cloud environment, and we don't recommend it for production deployments. However, it is easy (and free) to use in our RACE-based classroom and lab environments, and it is therefore the only persistent Storage Class currently available in them.
In our workshop environments, the default (and only available) storage class is nfs-client
. The files written to PVCs are ultimately written to a directory structure in the filesystem on the sasnode01
host in each collection, since that is where the cluster's NFS server runs. From a shell on the sasnode01
host, we can browse that filesystem and find the data under /srv/nfs/kubedata
. This is highly implementation-specific. There is little chance a customer deployment would be set up like this. Talk to your architect or Kubernetes administrator, and they may be able to suggest something similar to the following that makes sense in your environment.
You can find the size of the PVCs in the monitoring namespace using kubectl or Lens. In kubectl, try something like this (where monitoring
is the namespace where the monitoring components of SAS Viya Monitoring for Kubernetes are deployed):
kubectl get pvc -n monitoring
And look for the size of the PVC for Prometheus. In our workshop deployments, the monitoring namespace is v4mmon
, so here's what that looks like - scroll to the right to see the PVC capacities, i.e. their size:
[cloud-user@rext03-0272 ~]$ kubectl get pvc -n v4mmon NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE alertmanager-v4m-alertmanager-db-alertmanager-v4m-alertmanager-0 Bound pvc-5dfcb9bf-984d-4d19-911d-a573cd6390b0 10Gi RWO nfs-client 17h prometheus-v4m-prometheus-db-prometheus-v4m-prometheus-0 Bound pvc-5538cdf0-a591-462c-bfb1-dc7d5b37b12c 25Gi RWO nfs-client 17h v4m-grafana Bound pvc-512d42ac-ae98-4813-a8d0-c377f0fb3738 5Gi RWO nfs-client 17h
From this we can see that the size of the Prometheus PVC is 25Gi. Here is the same thing in Lens:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.</p
Either of these ways to see the PVC size should work on all environments, if you have kubectl or Lens, a kube config file which lets you access the cluster, and you know your monitoring namespace name.
So we know how big the PVC is, but how much of it is being used, and when the system has been running as long as the metric data retention period (which we will come to in a moment), how much of that space is likely to be used?
This is more implementation-specific. Since we are using NFS for our PVCs, and as explained above the data ends up on sasnode01
under/srv/nfs/kubedata
, this command will show how big the data in the Prometheus PVC actually is:
for f in /srv/nfs/kubedata/v4mmon-prometheus* ; do sudo du -xsh $f; done
Substitute the path to the kubedata directory in your NFS shared volume in place of /srv/nfs
above. Here is some example output from a very lightly-used workshop environment:
[cloud-user@rext03-0272 monitoring]$ for f in /srv/nfs/kubedata/v4mmon-prometheus* ; do sudo du -xsh $f; done 1.9G /srv/nfs/kubedata/v4mmon-prometheus-v4m-prometheus-db-prometheus-v4m-prometheus-0-pvc-5538cdf0-a591-462c-bfb1-dc7d5b37b12c
So in this environment, there is currently 1.9Gi of monitoring data, 18 or so hours after the environment started up, in a PVC with a nominal capacity of 25Gi. So it's currently something like 8% used.
The rate at which metric data is collected is fairly constant over time, so we can extrapolate that after 7 days, we would have ((2 / 17) * 24 * 7) = approx 20 Gi data, and the PVC might reach about 80% full.
However, as Raphaël explains in his post, when you use an NFS server as the backend of your PersistentVolumeClaims (PVCs): the claimed size is not enforced as a limit. So if the stored metric data were to grow over 25Gi in size, Kubernetes would not do anything to prevent it growing to fill the disk. This is one of several reasons why it is not optimal to use a single shared NFS mount for all our PVCs, or really to use NFS at all!
Fortunately, Prometheus has a feature called retention size that will limit the size of the metric data kept to a maximum of 20GiB by default. This limit can easily be changed. This means that even when it's not practical to change the Prometheus PVC size to manage the storage space used by Prometheus, you can still control how much space it uses quite effectively.
To see the current metric data retention period, follow these steps:
https://prometheus.ingress_controller_hostname/
where ingress_controller_hostname
is the full hostname of the sasnode01
host in our Kubernetes cluster.http://ingress_controller_hostname/prometheus
where ingress_controller_hostname
is the hostname of your Kubernetes cluster's ingress controller.kubectl get ingress -n monitoring
where monitoring
is the namespace in which the SAS Viya Monitoring for Kubernetes monitoring components are deployed. Look for an ingress named v4m-prometheus
.v4m-prometheus
.https://prometheus.ingress_controller_hostname/flags
--storage.tsdb.retention
The value of --storage.tsdb.retention.time in the screenshot above is 1w
meaning 1 week, and the value of --storage.tsdb.retention.size is 20GiB. Those are the defaults, but they may have other values in your deployment.
Aside: the flag --storage.tsdb.retention
has been deprecated since Prometheus 2.8; the very earliest release of SAS Viya Monitoring for Kubernetes used Prometheus 2.21.0 (or thereabouts - in ops4viya version 0.1.0), so this flag has been deprecated since long before SAS Viya Monitoring for Kubernetes was first released.
As Raphaël explains in his blog post, (and this is very similar to a process described in the Logging stack's Log_Retention.md) you can quickly change the Prometheus storage retention time and size by running Kubectl patch commands something like this, specifying your monitoring namespace in place of our v4mmon
(scroll the sample code below to the right to see the new values in bold😞
kubectl -n v4mmon patch prometheus v4m-prometheus --type merge --patch '{ "spec": { "retention": "2d" }}' kubectl -n v4mmon patch prometheus v4m-prometheus --type merge --patch '{ "spec": { "retentionSize": "5GiB" }}'
Thanks to the Prometheus operator, this change is noticed and the pods are restarted with the new values. However, if you do not also change the retention period and retention size in your ${USER_DIR)/monitoring/user-values-prom-operator.yaml
file as described in the next section, they would revert to the values defined in that file if you ever undeploy and redeploy the monitoring stack, for example when you scale your SAS Viya deployment down to save resources when you are not using it.
To change to the values in a way which will persist across an undeployment and redeployment, follow the process below as well as (or instead of) patching Prometheus's pod spec.
To change the metric data retention period and/or retention size (note: not the PVC storage size) so that the changes would persist if you undeploy and redeploy the monitoring stack, follow these steps:
monitoring/bin/deploy_monitoring_cluster.sh
.viya4-monitoring-kubernetes
directory, and the location of your local customization files, otherwise known as your USER_DIR
.viya4-monitoring-kubernetes
directory, and the USER_DIR
directory on the host machine from which you are deploying SAS Viya Monitoring for Kubernetes. These tips may help:
sasnode01
.viya4-monitoring-kubernetes
directory is/home/cloud-user/viya4-monitoring-kubernetes
, and the USER_DIR
directory is /home/cloud-user/.v4m
on sasnode01
.USER_DIR
and giving it a value of a filesystem path.locate user.env
on the machine you are using to deploy SAS Viya Monitoring for Kubernetes. Of the many directories this may reveal, one might be your USER_DIR
, and you might recognize it when you see it.USER_DIR
directory, look for a file at ${USER_DIR)/monitoring/user-values-prom-operator.yaml
. If you don't see one, copy viya4-monitoring-kubernetes/monitoring/user-values-prom-operator.yaml
to ${USER_DIR)/monitoring/user-values-prom-operator.yaml
.${USER_DIR)/monitoring/user-values-prom-operator.yaml
file, find an uncommented section that looks roughly like this - after previous customizations the values and perhaps even the structure will differ from this:prometheus: enabled: true prometheusSpec: externalUrl: http://host.mycluster.example.com:31090 retention: 7d retentionSize: 20GiB storageSpec: volumeClaimTemplate: spec: storageClassName: myCustomStorageClass resources: requests: storage: 25Gi
${USER_DIR)/monitoring/user-values-prom-operator.yaml
file, copy the existing commented-out section, then uncomment and keep only the lines in bold above, and their 'parent' lines, e.g.:prometheus: prometheusSpec: retention: 7d retentionSize: 20GiB
retention
, and if you wish, retentionSize
.
retentionSize
are indicated with a suffix of GiB (Gigbibytes, meaning 2^30 bytes), but the units for storage
are indicated with a suffix of Gi (which also means Gibibytes!). Thanks to my colleague Rob Collum for setting me straight on this point: the two suffixes here (GiB and Gi) are both power-of-two indicators, both have the same meaning: both indicate Gibibytes. If Gigabytes had been meant (and it isn't), the unit might be indicated as GB (and it is not). The point is, pay attention to the unit suffixes - they differ, but mean the same thing.AllowVolumeExpansion = true
, you also need a storage provisioner plug-in in your cluster that can resize the volumes. In our workshop environment, with the nfs-client
storage class, just specifying a new volume size like this did not work: kubectl -n v4mmon patch pvc prometheus-v4m-prometheus-db-prometheus-v4m-prometheus-0 -p '{"spec":{"resources":{"requests":{"storage":"35Gi"}}}}'
In our workshop environment, the nfs-client storage class has AllowVolumeExpansion = true
, and when I try to expand the Prometheus PVC by specifying a larger size, Lens reports the new larger size. But in fact the PVC does not actually change size and we get a Kubernetes error event for the PVC saying "Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.
". Therefore, at present the only method I have been able to successfully use for changing the size of the Prometheus PVC is to completely uninstall the entire monitoring stack, delete ALL of the monitoring namespace PVCs (in fact I prefer to just delete the whole namespace because it is simpler), and re-deploy the monitoring stack again with the new Prometheus storage size. Obviously, deleting the PVCs destroys any previously collected metric data, and any other user generated content such as custom Grafana dashboards etc. , so I do not recommend you do it unless you are happy to lose that data.${USER_DIR)/monitoring/user-values-prom-operator.yaml
.cd ~/viya4-monitoring-kubernetes/ export USER_DIR=/path/to/your/user_dir/directory # Used by v4m scripts ./monitoring/bin/deploy_monitoring_cluster.sh
Follow the steps under 'See the Current Metric Data Retention Period and Retention Size' above to see if the retention period and retention size changed as you intended. Hopefully, they did! If not, review the steps above and make sure you followed them correctly.
By following the steps in this post, you should be able to fully manage the storage space used by your SAS Viya Monitoring for Kubernetes instance of Prometheus, and have it retain metric data for the optimum length of time that your available storage permits.
Many thanks to my colleague Raphaël Poumarede ( @RPoumarede ) for his help with parts of this post, and to my colleague Rob Collum ( @RobCollum ) for correcting an error in an earlier version of the note on GiB vs Gi (which are both Gibibytes, 2^30 bytes or 1024^3 bytes) vs GB (which is Gigabytes, where 1GB = 1000 MB = 1000^3 bytes = 10^9 bytes).
See you next time!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.