A New Tool to Monitor CAS_DISK_CACHE Usage

9 Likes

SAS Cloud Analytic Services is - first and foremost - a high-performance in-memory analytics engine. In the ideal, when operating with everything it needs in memory, CAS operates as fast as the system's electrons can pass through it. However, we don't often enjoy the ideal work environment. SAS Viya and CAS must often support uncoordinated workloads as a mix of users perform ad hoc tasks of varying complexity on data of undefined sizes. In such scenarios, having a safety mechanism to fall back on when RAM alone isn't sufficient can mean the difference between a getting the job done versus a crashed system. That's where the CAS_DISK_CACHE comes in…

CAS_DISK_CACHE is disk storage space primarily relied upon to act as part of a virtual memory scheme for CAS tables. It has other uses as well, such as block replication to ensure data availability if a CAS worker goes offline. Effectively, it acts as a buffer to relieve pressure on RAM and improve CAS's resiliency and availability in dealing with unanticipated short-term challenges.

Because CAS_DISK_CACHE is an important part of CAS operations, provision it adequately and then monitor and maintain it over time. And in Viya 3.5, we have a new tool available to help with monitoring: the builtins.getCacheInfo action.

SAS Program Code

The builtins.getCacheInfo action is, as the name implies, built into CAS as of Viya 3.5. You don't need to define or load it. It's already there, just waiting for you to use it. So, let's see it in action:

/* if needed */
*options cashost="cashost.sas.com" casport=5570 authinfo="/path/to/.authinfo"; 
*title2 "<cas server instance> -- cashost.sas.com:5570";

cas sess1; 

proc cas; 
    session sess1; 
    accessControl.assumeRole / adminRole="superuser";       
    builtins.getCacheInfo result=results; 
    describe results; 
run;
    print results.diskCacheInfo; 
run;
quit; 

cas sess1 terminate;

This code example shows how to call builtins.getCacheInfo:

If needed, define which host the CAS controller is running on, the port it's listening at, and the location of an .authinfo file with administrator-level credentials that have the ability to assume the superuser role.
Connect to a new CAS session.
Run PROC CAS , assuming the superuser privilege, invoking the builtins.getCacheInfo action, and capturing the resulting data.
Print out the results

The results that are returned in the SAS listing output will look similar to:

diskCacheInfo: Results from builtins.getCacheInfo

Node	                    NodePath  Capacity	Free_Mem  %_Used
    cashost.sas.com    /cas/cctemp01	100 Gb	   86 Gb    13.9
    cashost.sas.com    /cas/cctemp02	100 Gb	   86 Gb    13.9
casworker01.sas.com	/cas/cache01	191 Gb	  161 Gb    15.6
casworker01.sas.com	/cas/cache02	191 Gb	  161 Gb    15.6
casworker01.sas.com	/cas/cache03	191 Gb	  161 Gb    15.6
casworker02.sas.com	/cas/cache01	191 Gb	  162 Gb    15.2
casworker02.sas.com	/cas/cache02	191 Gb	  162 Gb    15.2
casworker02.sas.com	/cas/cache03	191 Gb	  162 Gb    15.2

Of immediate note:

The disk space usage of CAS_CONTOLLER_TEMP is also reported. It only applies to the MPP CAS Controller host (here, that's cashost.sas.com).
CAS_DISK_CACHE (and CAS_CONTROLLER_TEMP) can be defined to include multiple directory paths.
Results show the incorrect units. A lowercase "b" would indicate "bits", but these numbers are indeed for "bytes" which should be shown as an uppercase "B". That is, GB not Gb. You can confirm this for yourself by querying the OS directly. This is a known issue with a scheduled fix.
The usage numbers are for the entire file system(s) where CAS_DISK_CACHE is located - not only the data blocks actually being stored there by CAS unless the file system(s) is dedicated exclusively to CAS cache (which is considered the ideal approach).

The results are arranged by CAS Controller's CAS_CONTROLLER_TEMP space followed by the CAS Workers' CAS_DISK_CACHE space as defined per directory. The %_Used column is what you're most likely to want to monitor closely as an indicator if the cache space is becoming overloaded. I recommend watching very closely at 70% or more. But keep in mind that depending on the available space, you might just be one large table load away from filling it all the way.

Note that the builtins.getCacheInfo action makes no assumptions about any underlying file systems or storage arrangements. It's up to you bring that intimate underlying physical system knowledge to bear in reading these results. For example, my test environment is very basic and very cheap. It does not provide high-performance storage with multiple redundant i/o channels to an array of many physical disks striped to act as a single logical volume. Instead I just created some simple subdirectories. And so the cache space on each of the CAS workers is just 191 GB total - not 573 GB (=3 × 191 GB). That is, casworker01 has 191 GB of file system space where the CAS_DISK_CACHE is specified to use three sub-directories. And the same is true for casworker02. So then the workers of my small MPP CAS Server have a combined space of 382 GB for CAS_DISK_CACHE.

Let's consider what this might look like for a real, production environment. To that end, imagine each CAS Worker has 16 physical CPU cores (not virtual cores or hyper threading). Depending on the storage technology available, then there might be 16 physical disks (1 per CPU core) for CAS_DISK_CACHE to achieve good performance, with each disk formatted as a single file system (xfs or ext4) and directly attached to the host (not via NFS, NAS, or SAN). And then the paths to each of those listed in env.CAS_DISK_CACHE. In that scenario, we'd expect to see 16 different lines for each CAS worker's cache in the report. And because each one would represent a unique file system space, then you could add them all up together (unlike my very basic illustration).

But wait, we can get really hardcore about this. For a high number of concurrent users in SAS Visual Analytics (100's or more), then there is evidence that defining two sub-directories per vCPU can provide some benefit as well (i.e. 4 sub-directories per disk/filesystem for CAS_DISK_CACHE). In that case, each CAS worker would have 64 directories for its cache. The total cache space wouldn't have changed from the 16 first imagined, but you'd have to take care in reading the report results to understand that.

SPECIAL NOTE: Instead of pointing CAS_DISK_CACHE to physical disks, there is an alternative: /dev/shm. That's a shared memory space utility provided by Linux systems - effectively placing the contents of CAS_DISK_CACHE in RAM.

The key benefit to this approach is that CAS_DISK_CACHE can perform at RAM speeds - something you're unlikely to approximate with any physical disk-based solution. The drawback to this approach is that you might consume RAM capacity much faster compared to using physical disk if COPIES=1 (the default) or more. The idea here is that replicating blocks will ensure table availability in case one or more CAS workers goes offline. You could set COPIES=0 to prevent that replication, but then you're disabling a significant enterprise feature of CAS.

So watch out - if you specify env.CAS_DISK_CACHE=/dev/shm and inadvertently overcommit the RAM capacity, then the OS will revert to its system paging file and everything on the host will slow e x c e s s i v e l y.

My recommendation is to use this approach carefully. It's ideal for well-behaved systems which consistently run within well-known parameters so that they can perform at the fastest speed possible. But for systems where the activities are mostly ad-hoc and/or require CAS' availability features, then look to physical disk for CAS_DISK_CACHE instead.

pyViyaTools

Viya supports having more than just one CAS server in the environment. You can have multiple CAS servers of different sizes, SMP or MPP, dedicated to specific tasks/users, etc. if needed. So then you might want to create multiple copies of that SAS program to get the cache statistics for each one. You could simply copy & paste the code over and over, changing the parameters to specify each CAS server. Or you could write some SAS Macro code to read from a parameter list, then generate and submit the code for each automatically.

But I thought it'd be nice to have an approach which can get those cache statistics for all CAS servers … and for it do so automatically without having to enter any manual parameter information.

My GEL colleague Gerry Nelson introduced the GEL pyViyaTools to leverage Viya's REST APIs and extend them to glean new information to help with system monitoring and management. So I looked at the utilities already there, opened up a couple program files to get familiar with their coding practices, and then wrote my own pyViyaTool called listallcasservercachestatus.py.

The code to do this in Python follows many of the same steps as in SAS:

# Other preliminary tasks not shown here
# Identifies CAS servers and loops to connect to each

   # connect to each CAS server
   s = swat.CAS(serverhost, serverport)                
    
   # get CAS_DISK_CACHE usage
   s.accessControl.assumeRole(adminRole="superuser")  # superuser role reqd  
   results = s.builtins.getCacheInfo                  # returns dictionary, table       
    
   # display table with CAS_DISK_CACHE usage stats
   print(tabulate(results,headers='firstrow'))

When you run listallcasservercachestatus.py, you get familiar looking results:

[cloud-user@viyahost pyviyatools]$ ./loginviauthinfo.py 
Logging in with profile:  rocoll
Enter credentials for https://viyahost.sas.com:
Login succeeded. Token saved.

[cloud-user@viyahost pyviyatools]$ ./listallcasservercachestatus.py 
server,host-or-ip,port,restPort
cas-shared-default,cashost.sas.com,5570,8777
(u'diskCacheInfo', Result table containing CAS_DISK_CACHE information

                   node          path   FS_size   FS_free FS_usage
0       cashost.sas.com  /cas/cache01    191 Gb    144 Gb     24.5
1       cashost.sas.com  /cas/cache02    191 Gb    144 Gb     24.5
2       cashost.sas.com  /cas/cache03    191 Gb    144 Gb     24.5
3   casworker01.sas.com  /cas/cache01    191 Gb    160 Gb     15.9
4   casworker01.sas.com  /cas/cache02    191 Gb    160 Gb     15.9
5   casworker01.sas.com  /cas/cache03    191 Gb    160 Gb     15.9
6   casworker02.sas.com  /cas/cache01    191 Gb    161 Gb     15.6
7   casworker02.sas.com  /cas/cache02    191 Gb    161 Gb     15.6
8   casworker02.sas.com  /cas/cache03    191 Gb    161 Gb     15.6

cas-shared-2NDSERVER,cas2nd.sas.com,5570,8777
(u'diskCacheInfo', Result table containing CAS_DISK_CACHE information

                    node          path   FS_size   FS_free FS_usage
0         cas2nd.sas.com  /cas/cache01    101 Gb     94 Gb      6.9
1         cas2nd.sas.com  /cas/cache02    101 Gb     94 Gb      6.9
2   cas2worker01.sas.com  /cas/cache01    101 Gb     89 Gb     11.9
3   cas2worker01.sas.com  /cas/cache02    101 Gb     89 Gb     11.9
4   cas2worker02.sas.com  /cas/cache01    101 Gb     91 Gb     10.0
5   cas2worker02.sas.com  /cas/cache02    101 Gb     91 Gb     10.0

Notice now that there are two different CAS servers listed, each with their own set of CAS_DISK_CACHE utilization statistics. That's because my Viya deployment has a second CAS server that I installed and the listallcasservercachestatus.py utility is coded get the list of CAS servers from the Viya REST API and then it loops through connecting to all of them to get their cache usage statistics.

As an aside, I think it's interesting that the Python-based code yields slightly different formatting than the SAS-based invocation of the builtins.getCacheInfo action, showing an obs count (but weirdly zero-based) and different column labels.

Running the builtins.getCacheInfo action

In the SAS program code sample shown above, I was able to run the builtins.getCacheInfo action after successfully establishing a user session on the CAS server. Depending on your CAS client, you might need to establish the antecedent authentication and server identification parameters yourself (e.g. from batch code submission) or it might already be done for you (e.g. submitting code interactively from SAS Studio). So, from a SAS programmer's perspective, this is well understood and the example given should be sufficient to get you going.

Running the pyViyaTools, however, takes just a little more effort and consideration.

First of all, read the INSTALL.md file for instructions on how to clone (i.e. download) the pyViyaTools files to your Viya installation followed by the steps required to authenticate to Viya and get a login token. This is a crucial step so don't move on until you're successful.

At this point, you might be able to successfully run the listallcasservercachestatus.py tool. Or you might not. It depends on how your environment was setup and the topology of the Viya software deployment.

So if you can't, I've got a couple of troubleshooting tips to help:

If you see a complaint about a package not being available, then use pip to install it. In particular, the listallcasservercachestatus.py tool relies on "tabulate" and "swat" packages for Python which are not included by default:
- pip install tabulate
- pip install swat
Another failure you might see depending on your Viya deployment topology is an error about being unable to connect to the CAS controller host, specifically stating, "ERROR: SSL Error: Missing CA trust list". To correct this, there's a third environment variable you should define (similar to what was necessary for loginviauthinfo.py):
- export CAS_CLIENT_SSL_CA_LIST= /opt/sas/spre/config/etc/SASSecurityCertificateFramework/cacerts/trustedcerts.pem
  # See SAS documentation for more information.

The pip install steps should only be necessary once. But the environment variables and acquiring a login token are steps that you should expect to perform pretty regularly when working with the pyViyaTools, so consider placing those in a shell script to run all together.

Under the covers

If you've administered UNIX or Linux environments before, then the output from the builtins.getCacheInfo action might look familiar to you. It's very similar to the output provided from the df command-line utility. That's because they both use the statvfs() system call to get their information.

The point to this is just to say that builtins.getCacheInfo is returning the statistics for the entire file system where the CAS_DISK_CACHE is located ... not only the files that CAS has cached in its own sub-directories. This is a Very Good Thing because it means you're provided with information about the file system's total capacity and utilization, which is the significant bit to monitor.

Wrapping up

Keeping CAS well maintained and humming along is vital to ensuring happy user experiences with Viya. One task in doing so is ensuring that the file system(s) hosting CAS_DISK_CACHE are adequate in size (as well as I/O throughput). In addition to OS utilities, SAS provides internal tools, like the builtins action set and the getCacheInfo action, to aid in reporting on that space. And we can extend on that using the pyViyaTools to automate that monitoring task across multiple CAS servers.

For more information about the related topics in this blog, refer to:

4 Rules to Understand CAS Management of In-Memory Data by Rob Collum
A New Path for CAS_DISK_CACHE by Rob Collum
Introducing the GEL pyviyatools by Gerry Nelson

And SAS documentation:

SAS Viya 3.5 System Programming Guide > List Cache Information for a CAS Session (CASL)

Don_Hayes · ‎04-27-2021

Awesome article, Rob. Thanks!

paterd2 · ‎01-03-2022

Great tool to see the cache details.

Thank you for that.

Dik

Bour9 · ‎06-26-2023

There's an error in the SAS Code

cas sess1; 

proc cas; 
    session sess1; 
    accessControl.assumeRole / adminRole="superuser";       
    builtins.getCacheInfo result=results; 
    describe results; 

/* You should not put a "run;" statement here. */
run;
    print results.diskCacheInfo; 
run;
quit; 

cas sess1 terminate;

RobCollum · ‎06-26-2023

@Bour9 ,

Thanks for the feedback and suggestion. I think this is just a stylistic difference - the run; statement in the middle of PROC CAS provides a slight separation in the log output. I agree with you that it's not needed syntactically, but I think it helps to clarify the source of some info in the log.

Without run; in the middle (notice line 87 is blank):

82 proc cas;
83 session sess1;
84 accessControl.assumeRole / adminRole="superuser";
85 builtins.getCacheInfo result=results;
86 describe results;
87
88 print results.diskCacheInfo;
89 run;
NOTE: Active Session now sess1.
dictionary ( 1 entries, 1 used);
[diskCacheInfo] Table ( [4] Rows [6] columns
Column Names:
[1] node [Node ] (varchar)
[2] FS [File System ] (varchar)
[3] FS_size [Capacity ] (char)
[4] FS_free [Free_Mem ] (char)
[5] FS_usage [%_Used ] (char)
[6] path [NodePath ] (varchar)
90 quit;
NOTE: The PROCEDURE CAS printed page 2.
NOTE: PROCEDURE CAS used (Total process time):
real time 0.17 seconds
cpu time 0.04 seconds
 
91
92 cas sess1 terminate;

With run; in the middle (see line 87):

82 proc cas;
83 session sess1;
84 accessControl.assumeRole / adminRole="superuser";
85 builtins.getCacheInfo result=results;
86 describe results;
87 run;
NOTE: Active Session now sess1.
dictionary ( 1 entries, 1 used);
[diskCacheInfo] Table ( [4] Rows [6] columns
Column Names:
[1] node [Node ] (varchar)
[2] FS [File System ] (varchar)
[3] FS_size [Capacity ] (char)
[4] FS_free [Free_Mem ] (char)
[5] FS_usage [%_Used ] (char)
[6] path [NodePath ] (varchar)
88 print results.diskCacheInfo;
89 run;
90 quit;
NOTE: The PROCEDURE CAS printed page 3.
NOTE: PROCEDURE CAS used (Total process time):
real time 0.15 seconds
cpu time 0.05 seconds
 
91
92 cas sess1 terminate;

This helps to clarify the log output to demonstrate that the "[diskcacheinfo]" (follows line 87, before line 88) is produced by the "describe results" statement - and not from the "print results.diskCacheInfo;" statement.

This usage of the run; statement is explained in the SAS® Viya® Platform Programming Documentation > CASL Reference > CAS Procedure > Run Statement.

Thanks again for sharing your suggestion!

Rob

paterd2 · ‎06-27-2023

Thanks for your code.

I made a studio job..

http://dikpater.blogspot.com/2023/05/sas-viya-cas-cache-cascache-disk-usage.html