SAS Cloud Analytic Services is - first and foremost - a high-performance in-memory analytics engine. In the ideal, when operating with everything it needs in memory, CAS operates as fast as the system's electrons can pass through it. However, we don't often enjoy the ideal work environment. SAS Viya and CAS must often support uncoordinated workloads as a mix of users perform ad hoc tasks of varying complexity on data of undefined sizes. In such scenarios, having a safety mechanism to fall back on when RAM alone isn't sufficient can mean the difference between a getting the job done versus a crashed system. That's where the CAS_DISK_CACHE comes in…
CAS_DISK_CACHE is disk storage space primarily relied upon to act as part of a virtual memory scheme for CAS tables. It has other uses as well, such as block replication to ensure data availability if a CAS worker goes offline. Effectively, it acts as a buffer to relieve pressure on RAM and improve CAS's resiliency and availability in dealing with unanticipated short-term challenges.
Because CAS_DISK_CACHE is an important part of CAS operations, provision it adequately and then monitor and maintain it over time. And in Viya 3.5, we have a new tool available to help with monitoring: the builtins.getCacheInfo
action.
The builtins.getCacheInfo
action is, as the name implies, built into CAS as of Viya 3.5. You don't need to define or load it. It's already there, just waiting for you to use it. So, let's see it in action:
/* if needed */ *options cashost="cashost.sas.com" casport=5570 authinfo="/path/to/.authinfo"; *title2 "<cas server instance> -- cashost.sas.com:5570"; cas sess1; proc cas; session sess1; accessControl.assumeRole / adminRole="superuser"; builtins.getCacheInfo result=results; describe results; run; print results.diskCacheInfo; run; quit; cas sess1 terminate;
This code example shows how to call builtins.getCacheInfo
:
.authinfo
file with administrator-level credentials that have the ability to assume the superuser role.PROC CAS
, assuming the superuser privilege, invoking the builtins.getCacheInfo
action, and capturing the resulting data.
The results that are returned in the SAS listing output will look similar to:
diskCacheInfo: Results from builtins.getCacheInfo Node NodePath Capacity Free_Mem %_Used cashost.sas.com /cas/cctemp01 100 Gb 86 Gb 13.9 cashost.sas.com /cas/cctemp02 100 Gb 86 Gb 13.9 casworker01.sas.com /cas/cache01 191 Gb 161 Gb 15.6 casworker01.sas.com /cas/cache02 191 Gb 161 Gb 15.6 casworker01.sas.com /cas/cache03 191 Gb 161 Gb 15.6 casworker02.sas.com /cas/cache01 191 Gb 162 Gb 15.2 casworker02.sas.com /cas/cache02 191 Gb 162 Gb 15.2 casworker02.sas.com /cas/cache03 191 Gb 162 Gb 15.2
Of immediate note:
The results are arranged by CAS Controller's CAS_CONTROLLER_TEMP space followed by the CAS Workers' CAS_DISK_CACHE space as defined per directory. The %_Used
column is what you're most likely to want to monitor closely as an indicator if the cache space is becoming overloaded. I recommend watching very closely at 70% or more. But keep in mind that depending on the available space, you might just be one large table load away from filling it all the way.
Note that the builtins.getCacheInfo
action makes no assumptions about any underlying file systems or storage arrangements. It's up to you bring that intimate underlying physical system knowledge to bear in reading these results. For example, my test environment is very basic and very cheap. It does not provide high-performance storage with multiple redundant i/o channels to an array of many physical disks striped to act as a single logical volume. Instead I just created some simple subdirectories. And so the cache space on each of the CAS workers is just 191 GB total - not 573 GB (=3 × 191 GB). That is, casworker01 has 191 GB of file system space where the CAS_DISK_CACHE is specified to use three sub-directories. And the same is true for casworker02. So then the workers of my small MPP CAS Server have a combined space of 382 GB for CAS_DISK_CACHE.
Let's consider what this might look like for a real, production environment. To that end, imagine each CAS Worker has 16 physical CPU cores (not virtual cores or hyper threading). Depending on the storage technology available, then there might be 16 physical disks (1 per CPU core) for CAS_DISK_CACHE to achieve good performance, with each disk formatted as a single file system (xfs or ext4) and directly attached to the host (not via NFS, NAS, or SAN). And then the paths to each of those listed in env.CAS_DISK_CACHE
. In that scenario, we'd expect to see 16 different lines for each CAS worker's cache in the report. And because each one would represent a unique file system space, then you could add them all up together (unlike my very basic illustration).
But wait, we can get really hardcore about this. For a high number of concurrent users in SAS Visual Analytics (100's or more), then there is evidence that defining two sub-directories per vCPU can provide some benefit as well (i.e. 4 sub-directories per disk/filesystem for CAS_DISK_CACHE). In that case, each CAS worker would have 64 directories for its cache. The total cache space wouldn't have changed from the 16 first imagined, but you'd have to take care in reading the report results to understand that.
SPECIAL NOTE: Instead of pointing CAS_DISK_CACHE to physical disks, there is an alternative:
The key benefit to this approach is that CAS_DISK_CACHE can perform at RAM speeds - something you're unlikely to approximate with any physical disk-based solution. The drawback to this approach is that you might consume RAM capacity much faster compared to using physical disk if
So watch out - if you specify
My recommendation is to use this approach carefully. It's ideal for well-behaved systems which consistently run within well-known parameters so that they can perform at the fastest speed possible. But for systems where the activities are mostly ad-hoc and/or require CAS' availability features, then look to physical disk for CAS_DISK_CACHE instead. |
Viya supports having more than just one CAS server in the environment. You can have multiple CAS servers of different sizes, SMP or MPP, dedicated to specific tasks/users, etc. if needed. So then you might want to create multiple copies of that SAS program to get the cache statistics for each one. You could simply copy & paste the code over and over, changing the parameters to specify each CAS server. Or you could write some SAS Macro code to read from a parameter list, then generate and submit the code for each automatically.
But I thought it'd be nice to have an approach which can get those cache statistics for all CAS servers … and for it do so automatically without having to enter any manual parameter information.
My GEL colleague Gerry Nelson introduced the GEL pyViyaTools to leverage Viya's REST APIs and extend them to glean new information to help with system monitoring and management. So I looked at the utilities already there, opened up a couple program files to get familiar with their coding practices, and then wrote my own pyViyaTool called listallcasservercachestatus.py
.
The code to do this in Python follows many of the same steps as in SAS:
# Other preliminary tasks not shown here # Identifies CAS servers and loops to connect to each # connect to each CAS server s = swat.CAS(serverhost, serverport) # get CAS_DISK_CACHE usage s.accessControl.assumeRole(adminRole="superuser") # superuser role reqd results = s.builtins.getCacheInfo # returns dictionary, table # display table with CAS_DISK_CACHE usage stats print(tabulate(results,headers='firstrow'))
When you run listallcasservercachestatus.py
, you get familiar looking results:
[cloud-user@viyahost pyviyatools]$ ./loginviauthinfo.py Logging in with profile: rocoll Enter credentials for https://viyahost.sas.com: Login succeeded. Token saved. [cloud-user@viyahost pyviyatools]$ ./listallcasservercachestatus.py server,host-or-ip,port,restPort cas-shared-default,cashost.sas.com,5570,8777 (u'diskCacheInfo', Result table containing CAS_DISK_CACHE information node path FS_size FS_free FS_usage 0 cashost.sas.com /cas/cache01 191 Gb 144 Gb 24.5 1 cashost.sas.com /cas/cache02 191 Gb 144 Gb 24.5 2 cashost.sas.com /cas/cache03 191 Gb 144 Gb 24.5 3 casworker01.sas.com /cas/cache01 191 Gb 160 Gb 15.9 4 casworker01.sas.com /cas/cache02 191 Gb 160 Gb 15.9 5 casworker01.sas.com /cas/cache03 191 Gb 160 Gb 15.9 6 casworker02.sas.com /cas/cache01 191 Gb 161 Gb 15.6 7 casworker02.sas.com /cas/cache02 191 Gb 161 Gb 15.6 8 casworker02.sas.com /cas/cache03 191 Gb 161 Gb 15.6 cas-shared-2NDSERVER,cas2nd.sas.com,5570,8777 (u'diskCacheInfo', Result table containing CAS_DISK_CACHE information node path FS_size FS_free FS_usage 0 cas2nd.sas.com /cas/cache01 101 Gb 94 Gb 6.9 1 cas2nd.sas.com /cas/cache02 101 Gb 94 Gb 6.9 2 cas2worker01.sas.com /cas/cache01 101 Gb 89 Gb 11.9 3 cas2worker01.sas.com /cas/cache02 101 Gb 89 Gb 11.9 4 cas2worker02.sas.com /cas/cache01 101 Gb 91 Gb 10.0 5 cas2worker02.sas.com /cas/cache02 101 Gb 91 Gb 10.0
Notice now that there are two different CAS servers listed, each with their own set of CAS_DISK_CACHE utilization statistics. That's because my Viya deployment has a second CAS server that I installed and the listallcasservercachestatus.py
utility is coded get the list of CAS servers from the Viya REST API and then it loops through connecting to all of them to get their cache usage statistics.
As an aside, I think it's interesting that the Python-based code yields slightly different formatting than the SAS-based invocation of the builtins.getCacheInfo
action, showing an obs count (but weirdly zero-based) and different column labels.
In the SAS program code sample shown above, I was able to run the builtins.getCacheInfo
action after successfully establishing a user session on the CAS server. Depending on your CAS client, you might need to establish the antecedent authentication and server identification parameters yourself (e.g. from batch code submission) or it might already be done for you (e.g. submitting code interactively from SAS Studio). So, from a SAS programmer's perspective, this is well understood and the example given should be sufficient to get you going.
Running the pyViyaTools, however, takes just a little more effort and consideration.
First of all, read the INSTALL.md
file for instructions on how to clone (i.e. download) the pyViyaTools files to your Viya installation followed by the steps required to authenticate to Viya and get a login token. This is a crucial step so don't move on until you're successful.
At this point, you might be able to successfully run the listallcasservercachestatus.py
tool. Or you might not. It depends on how your environment was setup and the topology of the Viya software deployment.
So if you can't, I've got a couple of troubleshooting tips to help:
listallcasservercachestatus.py
tool relies on "tabulate" and "swat" packages for Python which are not included by default:
pip install tabulate
pip install swat
ERROR: SSL Error: Missing CA trust list
". To correct this, there's a third environment variable you should define (similar to what was necessary for loginviauthinfo.py
):
export CAS_CLIENT_SSL_CA_LIST=
/opt/sas/spre/config/etc/SASSecurityCertificateFramework/cacerts/trustedcerts.pem
The pip
install steps should only be necessary once. But the environment variables and acquiring a login token are steps that you should expect to perform pretty regularly when working with the pyViyaTools, so consider placing those in a shell script to run all together.
If you've administered UNIX or Linux environments before, then the output from the builtins.getCacheInfo
action might look familiar to you. It's very similar to the output provided from the df
command-line utility. That's because they both use the statvfs()
system call to get their information.
The point to this is just to say that builtins.getCacheInfo
is returning the statistics for the entire file system where the CAS_DISK_CACHE is located ... not only the files that CAS has cached in its own sub-directories. This is a Very Good Thing because it means you're provided with information about the file system's total capacity and utilization, which is the significant bit to monitor.
Keeping CAS well maintained and humming along is vital to ensuring happy user experiences with Viya. One task in doing so is ensuring that the file system(s) hosting CAS_DISK_CACHE are adequate in size (as well as I/O throughput). In addition to OS utilities, SAS provides internal tools, like the builtins
action set and the getCacheInfo
action, to aid in reporting on that space. And we can extend on that using the pyViyaTools to automate that monitoring task across multiple CAS servers.
For more information about the related topics in this blog, refer to:
And SAS documentation:
Awesome article, Rob. Thanks!
Great tool to see the cache details.
Thank you for that.
Dik
There's an error in the SAS Code
cas sess1; proc cas; session sess1; accessControl.assumeRole / adminRole="superuser"; builtins.getCacheInfo result=results; describe results;
/* You should not put a "run;" statement here. */run;print results.diskCacheInfo; run; quit; cas sess1 terminate;
@Bour9 ,
Thanks for the feedback and suggestion. I think this is just a stylistic difference - the run; statement in the middle of PROC CAS provides a slight separation in the log output. I agree with you that it's not needed syntactically, but I think it helps to clarify the source of some info in the log.
Without run; in the middle (notice line 87 is blank):
82 proc cas;
83 session sess1;
84 accessControl.assumeRole / adminRole="superuser";
85 builtins.getCacheInfo result=results;
86 describe results;
87
88 print results.diskCacheInfo;
89 run;
NOTE: Active Session now sess1.
dictionary ( 1 entries, 1 used);
[diskCacheInfo] Table ( [4] Rows [6] columns
Column Names:
[1] node [Node ] (varchar)
[2] FS [File System ] (varchar)
[3] FS_size [Capacity ] (char)
[4] FS_free [Free_Mem ] (char)
[5] FS_usage [%_Used ] (char)
[6] path [NodePath ] (varchar)
90 quit;
NOTE: The PROCEDURE CAS printed page 2.
NOTE: PROCEDURE CAS used (Total process time):
real time 0.17 seconds
cpu time 0.04 seconds
91
92 cas sess1 terminate;
With run; in the middle (see line 87):
82 proc cas;
83 session sess1;
84 accessControl.assumeRole / adminRole="superuser";
85 builtins.getCacheInfo result=results;
86 describe results;
87 run;
NOTE: Active Session now sess1.
dictionary ( 1 entries, 1 used);
[diskCacheInfo] Table ( [4] Rows [6] columns
Column Names:
[1] node [Node ] (varchar)
[2] FS [File System ] (varchar)
[3] FS_size [Capacity ] (char)
[4] FS_free [Free_Mem ] (char)
[5] FS_usage [%_Used ] (char)
[6] path [NodePath ] (varchar)
88 print results.diskCacheInfo;
89 run;
90 quit;
NOTE: The PROCEDURE CAS printed page 3.
NOTE: PROCEDURE CAS used (Total process time):
real time 0.15 seconds
cpu time 0.05 seconds
91
92 cas sess1 terminate;
This helps to clarify the log output to demonstrate that the "[diskcacheinfo]" (follows line 87, before line 88) is produced by the "describe results" statement - and not from the "print results.diskCacheInfo;" statement.
This usage of the run; statement is explained in the SAS® Viya® Platform Programming Documentation > CASL Reference > CAS Procedure > Run Statement.
Thanks again for sharing your suggestion!
Rob
Thanks for your code.
I made a studio job..
http://dikpater.blogspot.com/2023/05/sas-viya-cas-cache-cascache-disk-usage.html
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.