Managing logging data retention period and size in SAS Viya Monitoring for Kubernetes

In my last post, we looked at Managing monitoring data retention period and size in SAS Viya Monitoring for Kubernetes. In this post, let's find out how much log data is currently being stored by OpenSearch or ElasticSearch, how big the PVCs used to store it are individually and all together, and how to change the log data retention period.

In SAS Viya Monitoring for Kubernetes (v4m for short), log messages and their contextual data are gathered by Fluent Bit, streamed to OpenSearch and displayed in OpenSearch Dashboards. Versions of v4m earlier than version 1.2.0 used ElasticSearch and Kibana, which were essentially the same things. As we will see, both the new and old versions of these tools manage log data retention and size in the same way, so we can explain them together.

Incidentally, log retention is covered by recently updated documentation in SAS Help Center here, in place of the README files that were part of the SAS Viya Monitoring for Kubernetes project before. I think this is a big improvement.

What exactly does 3 days' log retention mean?

Contrary to what you might expect, old log data is not deleted as soon as it becomes more than 3 days old. Rather, whole days' of data are deleted at a time, once per day. Here's how it works in detail.

SAS Viya Monitoring for Kubernetes configures OpenSearch to keep log data in many separate indices, a bit like table partitions in other databases. There are separate indices for:

each UTC day
each Kubernetes namespace (including separately for each Viya namespace if you have more than one)
each application-level tenant (if you have more than one)

The 'one set of indices per UTC day' bit is most relevant for this post.

Each day at 00:00 UTC, OpenSearch stops writing log data to the current set of indices, and starts writing it to a new set. Separately, a job runs every few minutes in OpenSearch, and checks if the creation date of each index is older than 3 days ago. If the creation date of an index is older than that, the whole index is deleted. OpenSearch and ElasticSearch don't delete individual log messages from indices as they become older than 3 days, they delete an entire day worth of indexes, and thus that day's entire set of log messages, more or less all at once.

Example

With the default log data retention period of 3 days, let's imagine you look at the system at any time on 4th January UTC. OpenSearch will be writing log data as it arrives into the set of indices for 4th January. (There is a date in the index name).

The sets of indices which were created moments after midnight on 3rd and 2nd January, each containing all the log data from one of those dates, will still be stored in the PVC and loaded into memory.

However, just after midnight on 4th January, the set of indices containing all the log data collected on 1st January was created just over 3 days ago. So the set of indices for 1st January were deleted by a maintenance job, just after midnight UTC on 4th January. This means that in practice, the 3 day retention period results in log data being available for the past 2 days plus the time since the most recent midnight UTC.

One more thing: in contrast to Prometheus's metric data, there is no enforced limit to log data size, other than the size of the PVC used to store the log data. It is quite possible to fill that PVC, which causes OpenSearch to stop working properly, so we need to know how to avoid that happening.

See how much log data storage is currently being used

The method for seeing how much storage is available and how much is actually being used depends on the type of Storage Class used for your Persistent Volume Claims. In our workshop environments we use a simple NFS share, but in a production environment you should use something better, like Azure Files or S3.

Note: NFS storage is not ideal in a cloud environment, and we don't recommend it for production deployments. However, it is easy (and free) to use in our RACE-based classroom and lab environments, and it is therefore the only persistent Storage Class currently available in them.

In our workshop environments, the default (and only) storage class is nfs-client. The files written to PVCs are ultimately written to a directory structure in the filesystem on the sasnode01 host in each collection, since that is where the cluster's NFS server runs. From a shell on the sasnode01 host, we can browse that filesystem and find the data under /srv/nfs/kubedata. This is highly implementation-specific. There is little chance a customer deployment would be set up like this. Talk to your architect or Kubernetes administrator, and they may be able to suggest something similar to the following that makes sense in your environment.

How big are the log data PVCs?

Find the specified number and size of PVCs

By default, our SAS Viya Monitoring for Kubernetes project configures and deploys OpenSearch or ElasticSearch with three pods in its v4m-search statefulset (OpenSearch, v4m 1.2.0 later) or its v4m-es-data statefulset (ElasticSearch, v4m 1.1.8 and earlier).

Each of these three pods has its own PVC. The aggregate storage size of these three PVCs is the total storage available to OpenSearch or ElasticSearch. Exactly what data is stored in these PVCs varies slightly between the two versions - OpenSearch uses just one set of PVCs for all of its data, whereas ElasticSearch has a second, smaller set of PVCs for 'master data', which I presume stores provided and user-created objects and other management data, but our focus here will be on the main data PVCs.

Our SAS Viya Monitoring for Kubernetes project is configured to request each of those PVCs to be 30 GiB (30 Gibibytes = 30 x 2³⁰ bytes). So across the 3 pods, by default OpenSearch or ElasticSearch requests 3 x 30 GiB = 90 GiB storage in total.

You can see how this is specified either in the user-values-opensearch.yaml file (OpenSearch) or the user-values-elasticsearch-open.yaml file (ElasticSearch) in the v4m USER_DIR/logging directory, or in the default helm charts referenced at the top of each of those files, if those files don't override the defaults. Look for a value for replicas for how many pods there should be in the statefulset, and (often rather separately) a persistence section containing a size value which may be something like 30Gi. So, in theory, that's what we should actually have.

Find the actual number and size of PVCs

In theory, there's no difference between theory and practice. But in practice, there is. You can find the actual size of the PVCs in the monitoring namespace in practice using kubectl or Lens. In kubectl, try something like this (where logging is the namespace where the log monitoring components of SAS Viya Monitoring for Kubernetes are deployed):

kubectl get pvc -n logging

Look for the number and size of the v4m-search-v4m-search-* PVCs (OpenSearch) or the data-v4m-es-data-* PVCs (ElasticSearch).

In one of our workshop deployments with v4m 1.2.1, the logging (or if you prefer, log monitoring) namespace is v4mlog, so here's what that looks like for OpenSearch:

[cloud-user@hostname logging]$ kubectl get pvc -n v4mlog
NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
v4m-search-v4m-search-0   Bound    pvc-82514968-1728-4bc8-95b7-708b08043af1   30Gi       RWO            nfs-client     25h
v4m-search-v4m-search-1   Bound    pvc-82ffb122-1730-4780-825c-ce8887e0b01f   30Gi       RWO            nfs-client     25h
v4m-search-v4m-search-2   Bound    pvc-47e35a8e-423c-4490-bbb7-a17e5e0def17   30Gi       RWO            nfs-client     25h

In another of our workshop deployments with v4m 1.1.8, this is the equivalent for ElasticSearch:

[cloud-user@hostname ~]$ kubectl get pvc -n v4mlog
NAME                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-v4m-es-data-0     Bound    pvc-ed9314dc-06b4-471a-9a37-499cf42ddcc9   30Gi       RWO            nfs-client     19d
data-v4m-es-data-1     Bound    pvc-4127753a-528b-40ec-93c8-3826c6397d5c   30Gi       RWO            nfs-client     19d
data-v4m-es-data-2     Bound    pvc-804a632c-97a4-4e7b-942d-55ba6fd6828c   30Gi       RWO            nfs-client     19d
data-v4m-es-master-0   Bound    pvc-18252229-6e88-4e44-9037-15ccd68664f0   8Gi        RWO            nfs-client     19d
data-v4m-es-master-1   Bound    pvc-2047fd01-0c15-4f90-842a-a86129647580   8Gi        RWO            nfs-client     19d
data-v4m-es-master-2   Bound    pvc-a68baa6b-fde8-47d4-9de7-2970d0be9d48   8Gi        RWO            nfs-client     19d

From this we can see that the size of the total data storage for OpenSearch or ElasticSearch is 3 x 30Gi = 90Gi.

Here is the OpenSearch PVC set in Lens:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Either of these ways to see the PVC size should work on all environments, if you have kubectl or Lens, a kube config file which lets you access the cluster, and you know your log monitoring namespace name.

How much data is being stored in the log data PVCs?

So we know how big the PVCs are. How much of that space is being used?

This is more implementation-specific. Since we are using NFS for our PVCs, as explained above the data ends up on sasnode01 under/srv/nfs/kubedata. Let's look at how to find the amount of data stored on our NFS share - the actual values below are not typical; they are likely much smaller than you should expect in a real production environment.

PVC use by OpenSearch in v4m 1.2.0 and later

If you have v4m version 1.2.0 or later, something along the lines of this command might show how big the data in the three OpenSearch PVCs actually is:

for f in /srv/nfs/kubedata/v4mlog-v4m-search-v4m-search-* ; do sudo du -xsh $f; done

Substitute the path to the kubedata directory in your NFS shared volume in place of /srv/nfs above. Here is some example output from a very lightly-used workshop environment:

[cloud-user@hostname logging]$ for f in /srv/nfs/kubedata/v4mlog-v4m-search-v4m-search-* ; do sudo du -xsh $f; done
4.5G    /srv/nfs/kubedata/v4mlog-v4m-search-v4m-search-0-pvc-82514968-1728-4bc8-95b7-708b08043af1
3.1G    /srv/nfs/kubedata/v4mlog-v4m-search-v4m-search-1-pvc-82ffb122-1730-4780-825c-ce8887e0b01f
5.7G    /srv/nfs/kubedata/v4mlog-v4m-search-v4m-search-2-pvc-47e35a8e-423c-4490-bbb7-a17e5e0def17

So in this environment, there is currently 4.5G + 3.1G + 5.7G = 13.3GiB of log data. Well, actually, slightly less than that. These PVC directories also contain a few MB of other stored data. Remember this example is intended to show how to calculate disk usage on our NFS storage, not to give estimates of typical usage.

PVC use by ElasticSearch in v4m 1.1.8 and earlier

If you have v4m version 1.1.8 or earlier, something along the lines of this command might show how big the data in the three ElasticSearch data PVCs actually is:

for f in /srv/nfs/kubedata/v4mlog-data-v4m-es-data-* ; do sudo du -xsh $f; done

Substitute the path to the kubedata directory in your NFS shared volume in place of /srv/nfs above. Here is some example output from a very lightly-used workshop environment:

[cloud-user@hostname ~]$ for f in /srv/nfs/kubedata/v4mlog-data-v4m-es-data-* ; do sudo du -xsh $f; done
1.3G    /srv/nfs/kubedata/v4mlog-data-v4m-es-data-0-pvc-ed9314dc-06b4-471a-9a37-499cf42ddcc9
511M    /srv/nfs/kubedata/v4mlog-data-v4m-es-data-1-pvc-4127753a-528b-40ec-93c8-3826c6397d5c
854M    /srv/nfs/kubedata/v4mlog-data-v4m-es-data-2-pvc-804a632c-97a4-4e7b-942d-55ba6fd6828c

So in this environment, there is currently 1.3G + (511/1024)G + (854/1024)G = 2.63GiB of log data. Again, this is meant to show the method for calculating usage, not to give estimates of typical usage.

Alternative - look at the OpenSearch Dashboards > Index Management > Indices page

Another way to roughly estimate log data size - and one which does not require command line access to the servers or a kube config file - is to open the OpenSearch Dashboards (or Kibana) Index Management page, and switch to the Indices tab. This tab shows a table of all Indices currently held in OpenSearch (or ElasticSearch), with statistics for each index including its total size:

With a bit of patience, you could manually add up the 'Total size' values of the each row. The table is paged, so make sure you include each page. It is possible that there is an API or command line way to do this - I have not explored OpenSearch or ElasticSearch APIs or command-line tools.

How fast will it grow?

The rate at which log data is collected can be quite volatile over time, as it depends very much on how heavily SAS Viya and other applications in the Kubernetes cluster are used, as well as on whether log thresholds are changed to e.g. increase log detail and for how long. Extrapolating log data growth over time is therefore more of an art than a science, but it's worth estimating (or guess-timating) based on what you know about how your collection's level of activity now, historical log data size and your understanding of how you are likely to change logging levels. Then monitor log data size closely. This may reveal that your early estimates are quite inaccurate, but it's better than not estimating and monitoring it at all.

See the Current Log Data Retention Period

To see the current log data retention period, follow these steps:

Find the URL for OpenSearch Dashboards or Kibana in your deployment of SAS Viya Monitoring for Kubernetes. These tips may help:
- In our workshop environments:
  - If v4m version 1.2.0 or later is deployed, the OpenSearch Dashboards URL is https://osd.ingress_controller_hostname/ where ingress_controller_hostname is the full hostname of the sasnode01 host in our Kubernetes cluster. For example, in one RACE collection I happen to have running, this is http://osd.pdcesx02020.race.sas.com. We are considering changing the hostname prefix from osd to something else to avoid confusion with SAS/ODS, so if you don't find it this way, check the workshop instructions for where to find OpenSearch Dashboards.
  - If v4m 1.1.8 or earlier is deployed, the Kibana URL is https://kibana.ingress_controller_hostname/ where ingress_controller_hostname is the full hostname of the sasnode01 host in our Kubernetes cluster.
- A more typical default URL for OpenSearch Dashboards is something like http://ingress_controller_hostname/dashboards where ingress_controller_hostname is the hostname of your Kubernetes cluster's ingress controller.
- A more typical default URL for Kibana is something like http://ingress_controller_hostname/kibana where ingress_controller_hostname is the hostname of your Kubernetes cluster's ingress controller.
- If you are not sure, try running kubectl get ingress -n logging where logging is the namespace in which the SAS Viya Monitoring for Kubernetes logging (or log monitoring) components are deployed. Look for an ingress named v4m-osd or v4m-es-kibana-ingdepending on whether you have v4m 1.2.0 or later, or 1.1.8 or earlier.
- You can also find this in Lens, under Network > Ingresses, and filter to your monitoring namespace. Again, look for an ingress named v4m-osd or v4m-es-kibana-ing.
In your web browser, browser to OpenSearch Dashboards or Kibana, and sign in as admin.
If you are prompted to select your tenant, choose the most appropriate tenant (there may be no choice other than cluster_admins), check the checkbox for 'Remember my selection next time I log in from this device' if there is one, and click Confirm.
In OpenSearch Dashboards or Kibana, use the navigation menu on the left to browse to the Index Management page. You may have to scroll down to see it.
On the Index Management page, the default tab is the Index policies tab. On that tab, click viya_logs_idxmgmt_policy:
Look at the policy. In OpenSearch Dashboards, you need to scroll down to the States section, and expand the 'hot' state to see when indices in this state are meant to transition to the 'doomed' state. It shows that this is to happen when the 'Minimum index age is 3d':

For ElasticSearch, you just see a JSON representation of the policy, but it is quite readable:

Here, '3d' means 3 days. The only units that are sensible in the context of our SAS Viya Monitoring for Kubernetes configuration are whole days.

Change Log Data Retention Period

There are two steps to changing the log data retention period.

First, change the index management policy, and optionally second, re-apply that policy to existing indices.

Let's start by changing the policy. From the page where you see the index management policy details, follow these steps:

Click the Edit button to edit the index management policy. The user interfaces for editing the index policies differ between OpenSearch and ElasticSearch:
1. OpenSearch/OpenSearch Dashboards offers you a choice of a visual editor, or JSON editor. The visual editor is okay, if slightly convoluted to use. The JSON editor is less slick, but I find it quicker and more straightforward to use. In the index management policy detail page, click the Edit button top right, and choose the JSON editor.
2. ElasticSearch/Kibana only has a JSON editor. Clicking the Edit button opens it.
The 'Edit policy' page is similar for both OpenSearch and ElasticSearch. Scroll down to the JSON document, find "min_index_age": "3d", and change it to a new value, e.g. "min_index_age": "4d". Then click Update.

Read the OpenSearch or ElasticSearch documentation if you want to know more about making your own index management policy; that is beyond the scope of this post. We are only trying to change the retention period in an existing policy.

Next, apply the changed policy to existing SAS Viya indices:

In either OpenSearch Dashboards, or ElasticSearch/Kibana, switch to the Managed Indices tab in the Index Management page.
Click the 'Change policy' button in the top right corner of this page.
In the Managed indices filter box, type viya_logs-, and accept the suggested value of viya_logs-*.
Below this (scroll down if necessary), under Choose new policy, select viya_log_idxmgmt_policy from the dropdown list.
Click the Change button, on the right. A popup message should appear saying that some number of indexes were changed.

When you change the log retention period interactively like this, it is recommend to also update the LOG_RETENTION_PERIOD value in your in USER_DIR/logging/user.env to keep it consistent with this change. This is sensible, because if you later remove and redeploy the v4m logging stack, your new configuration (which in this case keeps indices for e.g. 4 days) is preserved, instead of reverting to the original configuration (which keeps indices for 3 days). We made a similar recommendation for metric data in my previous post.

Hopefully this gives a reasonably complete explanation of finding the PVC size and usage, and finding and changing the log data retention period. I have intentionally not covered changing the PVC size, in either this post or my previous one. That is a sufficiently complex topic to explain that it deserves a post of its own, mostly because the type of storage class used for your PVCs greatly affects the method for changing its size, and also greatly effects whether you can change the PVC size without dropping and re-creating all the data currently stored in it. See you next time!

Find more articles from SAS Global Enablement and Learning here.