Managing monitoring data retention period and size in SAS Viya Monitoring for Kubernetes

In SAS Viya Monitoring for Kubernetes, metric data is gathered by Prometheus and displayed in Grafana. As this blog post explains, Prometheus will keep metric data until either the amount of data it has exceeds a specified retention size, at which point it will begin deleting the oldest data to make way for new metric data, or until that metric data exceeds the specified retention time. If the retention time limit is reached first, the metric data older than that limit will be deleted even if the total size of data retained is smaller than the retention size.

In this post, we'll see how to calculate how much space is being used to store Prometheus metric data in your cluster, and how to change both the retention period and retention size.

Incidentally, my colleague Raphaël Poumarede (@RPoumarede) has written here recently about managing many aspects of storage, in his posts Take care of your Viya storage before it takes care of you – Part 1: Planning ahead and anticipating and Take care of your Viya storage before it takes care of you – Part 2: purge and expand. They are fantastic posts - do read them!

See how much metric data storage is currently being used (NFS)

The method for seeing how much storage is available and how much is actually being used depends on the type of Storage Class used for your Kubernetes Persistent Volume Claims (PVCs). In our internal GEL workshop environments we use a simple NFS share, but in a production environment you should use something better, like Azure Files or S3.

Note: NFS storage is not ideal in a cloud environment, and we don't recommend it for production deployments. However, it is easy (and free) to use in our RACE-based classroom and lab environments, and it is therefore the only persistent Storage Class currently available in them.

In our workshop environments, the default (and only available) storage class is nfs-client. The files written to PVCs are ultimately written to a directory structure in the filesystem on the sasnode01 host in each collection, since that is where the cluster's NFS server runs. From a shell on the sasnode01 host, we can browse that filesystem and find the data under /srv/nfs/kubedata. This is highly implementation-specific. There is little chance a customer deployment would be set up like this. Talk to your architect or Kubernetes administrator, and they may be able to suggest something similar to the following that makes sense in your environment.

How big is the metric data PVC?

You can find the size of the PVCs in the monitoring namespace using kubectl or Lens. In kubectl, try something like this (where monitoring is the namespace where the monitoring components of SAS Viya Monitoring for Kubernetes are deployed):

kubectl get pvc -n monitoring

And look for the size of the PVC for Prometheus. In our workshop deployments, the monitoring namespace is v4mmon, so here's what that looks like - scroll to the right to see the PVC capacities, i.e. their size:

[cloud-user@rext03-0272 ~]$ kubectl get pvc -n v4mmon
NAME                                                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-v4m-alertmanager-db-alertmanager-v4m-alertmanager-0   Bound    pvc-5dfcb9bf-984d-4d19-911d-a573cd6390b0   10Gi       RWO            nfs-client     17h
prometheus-v4m-prometheus-db-prometheus-v4m-prometheus-0           Bound    pvc-5538cdf0-a591-462c-bfb1-dc7d5b37b12c   25Gi       RWO            nfs-client     17h
v4m-grafana                                                        Bound    pvc-512d42ac-ae98-4813-a8d0-c377f0fb3738   5Gi        RWO            nfs-client     17h

From this we can see that the size of the Prometheus PVC is 25Gi. Here is the same thing in Lens:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.</p

Either of these ways to see the PVC size should work on all environments, if you have kubectl or Lens, a kube config file which lets you access the cluster, and you know your monitoring namespace name.

How much data is being stored in the metric data PVC?

So we know how big the PVC is, but how much of it is being used, and when the system has been running as long as the metric data retention period (which we will come to in a moment), how much of that space is likely to be used?

This is more implementation-specific. Since we are using NFS for our PVCs, and as explained above the data ends up on sasnode01 under/srv/nfs/kubedata, this command will show how big the data in the Prometheus PVC actually is:

for f in /srv/nfs/kubedata/v4mmon-prometheus* ; do sudo du -xsh $f; done

Substitute the path to the kubedata directory in your NFS shared volume in place of /srv/nfs above. Here is some example output from a very lightly-used workshop environment:

[cloud-user@rext03-0272 monitoring]$ for f in /srv/nfs/kubedata/v4mmon-prometheus* ; do sudo du -xsh $f; done
1.9G    /srv/nfs/kubedata/v4mmon-prometheus-v4m-prometheus-db-prometheus-v4m-prometheus-0-pvc-5538cdf0-a591-462c-bfb1-dc7d5b37b12c

So in this environment, there is currently 1.9Gi of monitoring data, 18 or so hours after the environment started up, in a PVC with a nominal capacity of 25Gi. So it's currently something like 8% used.

The rate at which metric data is collected is fairly constant over time, so we can extrapolate that after 7 days, we would have ((2 / 17) * 24 * 7) = approx 20 Gi data, and the PVC might reach about 80% full.

However, as Raphaël explains in his post, when you use an NFS server as the backend of your PersistentVolumeClaims (PVCs): the claimed size is not enforced as a limit. So if the stored metric data were to grow over 25Gi in size, Kubernetes would not do anything to prevent it growing to fill the disk. This is one of several reasons why it is not optimal to use a single shared NFS mount for all our PVCs, or really to use NFS at all!

Fortunately, Prometheus has a feature called retention size that will limit the size of the metric data kept to a maximum of 20GiB by default. This limit can easily be changed. This means that even when it's not practical to change the Prometheus PVC size to manage the storage space used by Prometheus, you can still control how much space it uses quite effectively.

See the Current Metric Data Retention Period and Retention Size

To see the current metric data retention period, follow these steps:

Find the URL for Prometheus in your deployment of SAS Viya Monitoring for Kubernetes. These tips may help:
- In our internal GEL workshop environments, the Prometheus URL is https://prometheus.ingress_controller_hostname/ where ingress_controller_hostname is the full hostname of the sasnode01 host in our Kubernetes cluster.
- A more typical default URL for Prometheus is something like http://ingress_controller_hostname/prometheus where ingress_controller_hostname is the hostname of your Kubernetes cluster's ingress controller.
- If you are not sure, try running kubectl get ingress -n monitoring where monitoring is the namespace in which the SAS Viya Monitoring for Kubernetes monitoring components are deployed. Look for an ingress named v4m-prometheus.
- You can also find this in Lens, under Network > Ingresses, and filter to your monitoring namespace. Again, look for an ingress named v4m-prometheus.
In your web browser, open the Prometheus Command-Line Flags page, at Prometheus_URL/flags where Prometheus_URL is the URL you found in the previous step.
- In our workshop environments, this is https://prometheus.ingress_controller_hostname/flags
n the Prometheus Command-Line Flags page, in the 'Filter by flag, name or value' textbox, enter the string: --storage.tsdb.retention

The value of --storage.tsdb.retention.time in the screenshot above is 1w meaning 1 week, and the value of --storage.tsdb.retention.size is 20GiB. Those are the defaults, but they may have other values in your deployment.

Aside: the flag --storage.tsdb.retention has been deprecated since Prometheus 2.8; the very earliest release of SAS Viya Monitoring for Kubernetes used Prometheus 2.21.0 (or thereabouts - in ops4viya version 0.1.0), so this flag has been deprecated since long before SAS Viya Monitoring for Kubernetes was first released.

Change Metric Data Retention Period and Retention Size: quick but non-persistent method

As Raphaël explains in his blog post, (and this is very similar to a process described in the Logging stack's Log_Retention.md) you can quickly change the Prometheus storage retention time and size by running Kubectl patch commands something like this, specifying your monitoring namespace in place of our v4mmon(scroll the sample code below to the right to see the new values in bold😞

kubectl -n v4mmon patch prometheus v4m-prometheus --type merge --patch '{ "spec": { "retention": "2d" }}'
kubectl -n v4mmon patch prometheus v4m-prometheus --type merge --patch '{ "spec": { "retentionSize": "5GiB" }}'

Thanks to the Prometheus operator, this change is noticed and the pods are restarted with the new values. However, if you do not also change the retention period and retention size in your ${USER_DIR)/monitoring/user-values-prom-operator.yaml file as described in the next section, they would revert to the values defined in that file if you ever undeploy and redeploy the monitoring stack, for example when you scale your SAS Viya deployment down to save resources when you are not using it.

To change to the values in a way which will persist across an undeployment and redeployment, follow the process below as well as (or instead of) patching Prometheus's pod spec.

Change Metric Data Retention Period and Retention Size: slower but persistent method

To change the metric data retention period and/or retention size (note: not the PVC storage size) so that the changes would persist if you undeploy and redeploy the monitoring stack, follow these steps:

1. If you have not installed SAS Viya Monitoring for Kubernetes yet, follow the installation steps as far customizing the deployment of the monitoring components, but return here before deploying them. Some notes about this:
  - An official set of instructions are here: https://github.com/sassoftware/viya4-monitoring-kubernetes/blob/master/monitoring/README.md.
  - Documentation for the customization and deployment process for the monitoring components can also be found in the SAS Viya Administration Guide.
  - Follow the steps under 'Perform Pre-Deployment Tasks', including 'Create a Local Copy of the Repository' and 'Customize the Deployment' in the project README.md, or up the point in the Administration Guide where you are ready to run monitoring/bin/deploy_monitoring_cluster.sh.
  - You may have a different method for deploying the project - that's fine.
  - Once you have performed the pre-deployment tasks, you should know the location of your viya4-monitoring-kubernetes directory, and the location of your local customization files, otherwise known as your USER_DIR.
2. Alternatively, if the monitoring components of SAS Viya Monitoring for Kubernetes are already deployed, find the path of the viya4-monitoring-kubernetes directory, and the USER_DIR directory on the host machine from which you are deploying SAS Viya Monitoring for Kubernetes. These tips may help:
  - Identify the host machine used to deploy SAS Viya Monitoring for Kubernetes. In our workshop environments, this is sasnode01.
  - In our workshop environments, the viya4-monitoring-kubernetes directory is/home/cloud-user/viya4-monitoring-kubernetes, and the USER_DIR directory is /home/cloud-user/.v4m on sasnode01.
  - Ask to whoever deployed SAS Viya Monitoring for Kubernetes where these two directories are - they should know.
  - If you know the command or script that was or will be used to deploy SAS Viya Monitoring for Kubernetes, look for the path preceding the deployment scripts, and also look for something defining an environment variable USER_DIR and giving it a value of a filesystem path.
  - Try running a command like locate user.env on the machine you are using to deploy SAS Viya Monitoring for Kubernetes. Of the many directories this may reveal, one might be your USER_DIR, and you might recognize it when you see it.
3. In your USER_DIR directory, look for a file at ${USER_DIR)/monitoring/user-values-prom-operator.yaml. If you don't see one, copy viya4-monitoring-kubernetes/monitoring/user-values-prom-operator.yaml to ${USER_DIR)/monitoring/user-values-prom-operator.yaml.
4. Edit your ${USER_DIR)/monitoring/user-values-prom-operator.yaml file, find an uncommented section that looks roughly like this - after previous customizations the values and perhaps even the structure will differ from this:
```
prometheus:
  enabled: true
  prometheusSpec:
    externalUrl: http://host.mycluster.example.com:31090
    retention: 7d
    retentionSize: 20GiB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: myCustomStorageClass
          resources:
            requests:
              storage: 25Gi
```
  If you don't have an uncommented section like that in your ${USER_DIR)/monitoring/user-values-prom-operator.yaml file, copy the existing commented-out section, then uncomment and keep only the lines in bold above, and their 'parent' lines, e.g.:
```
prometheus:
  prometheusSpec:
    retention: 7d
    retentionSize: 20GiB
```
  Note that correct indentation in YAML is important.
5. Edit the values for retention, and if you wish, retentionSize.
  - Note especially that the units for retentionSize are indicated with a suffix of GiB (Gigbibytes, meaning 2^30 bytes), but the units for storage are indicated with a suffix of Gi (which also means Gibibytes!). Thanks to my colleague Rob Collum for setting me straight on this point: the two suffixes here (GiB and Gi) are both power-of-two indicators, both have the same meaning: both indicate Gibibytes. If Gigabytes had been meant (and it isn't), the unit might be indicated as GB (and it is not). The point is, pay attention to the unit suffixes - they differ, but mean the same thing.
  - Note that I am not advising you to change the storageSize. Resizing an existing PVC without deleting and re-creating it (and thereby losing all the data in it) is not supported for all Kubernetes Storage Classes. Even if you are using a storage class for which AllowVolumeExpansion = true, you also need a storage provisioner plug-in in your cluster that can resize the volumes. In our workshop environment, with the nfs-client storage class, just specifying a new volume size like this did not work: kubectl -n v4mmon patch pvc prometheus-v4m-prometheus-db-prometheus-v4m-prometheus-0 -p '{"spec":{"resources":{"requests":{"storage":"35Gi"}}}}' In our workshop environment, the nfs-client storage class has AllowVolumeExpansion = true, and when I try to expand the Prometheus PVC by specifying a larger size, Lens reports the new larger size. But in fact the PVC does not actually change size and we get a Kubernetes error event for the PVC saying "Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.". Therefore, at present the only method I have been able to successfully use for changing the size of the Prometheus PVC is to completely uninstall the entire monitoring stack, delete ALL of the monitoring namespace PVCs (in fact I prefer to just delete the whole namespace because it is simpler), and re-deploy the monitoring stack again with the new Prometheus storage size. Obviously, deleting the PVCs destroys any previously collected metric data, and any other user generated content such as custom Grafana dashboards etc. , so I do not recommend you do it unless you are happy to lose that data.
  - If you are confident that you know how to resize PVCs using the storage class used for the Prometheus PVC in your cluster, then this process can be non-destructive of data and user content. But I have not been able to test it successfully in my environment, and therefore am not describing the steps here.
6. You may also need to modify other lines in both this section of this file, and in other areas of the file, for example to define the storage class, or ingress. See the pre-deployment customization instructions.
7. Save your changes to ${USER_DIR)/monitoring/user-values-prom-operator.yaml.
8. Now you can follow your usual deployment steps to re-deploy (or deploy) the cluster-level monitoring components, e.g.:
```
cd ~/viya4-monitoring-kubernetes/
export USER_DIR=/path/to/your/user_dir/directory # Used by v4m scripts
./monitoring/bin/deploy_monitoring_cluster.sh
```
  In our workshop environments, redeploying the cluster level monitoring components of SAS Viya Monitoring for Kubernetes typically takes 1 to 2 minutes.

Assess whether your changes took effect

Follow the steps under 'See the Current Metric Data Retention Period and Retention Size' above to see if the retention period and retention size changed as you intended. Hopefully, they did! If not, review the steps above and make sure you followed them correctly.

By following the steps in this post, you should be able to fully manage the storage space used by your SAS Viya Monitoring for Kubernetes instance of Prometheus, and have it retain metric data for the optimum length of time that your available storage permits.

Many thanks to my colleague Raphaël Poumarede ( @RPoumarede ) for his help with parts of this post, and to my colleague Rob Collum ( @RobCollum ) for correcting an error in an earlier version of the note on GiB vs Gi (which are both Gibibytes, 2^30 bytes or 1024^3 bytes) vs GB (which is Gigabytes, where 1GB = 1000 MB = 1000^3 bytes = 10^9 bytes).

See you next time!