BookmarkSubscribeRSS Feed

When is CAS_DISK_CACHE used?

Started ‎03-10-2020 by
Modified ‎03-10-2020 by
Views 10,661

As Rob Collum said in his post ("Provisioning CAS_DISK_CACHE for SAS Viya"), CAS_DISK_CACHE (or the physical paths behind) represent the disk-based backing store of CAS. How much disk space should you dedicate to CAS_DISK_CACHE? It’s complicated to estimate. Let’s look at all the different scenarios where CAS stores data in CAS_DISK_CACHE to get an idea of just how big it can get.

 

CAS_DISK_CACHE usage scenarios:

 

1 – Loading data/creating tables in CAS

 

Basically, any operation that loads or creates data directly in CAS uses the CAS_DISK_CACHE location for on-disk caching. This could be:

  • Client-side load using the PROC CASUTIL LOAD DATA (or FILE) statement or the addTable CAS action
  • Server-side load using the PROC CASUTIL LOAD CASDATA statement or the loadTable CAS action
  • Any SAS step that could create tables: data step, fedSQL, DS2, any proc that outputs data
  • Any tool that outputs data like Data Studio

The blocks that are in CAS_DISK_CACHE are the same as the ones in-memory (1 to 1). They are used for supporting on-demand data blocks movement in and out of memory.

This applies for all non-SASHDAT source files (legacy SAS7BDAT, CSV, DBMS, etc.). But even some situations involving SASHDAT files also use CAS_DISK_CACHE.

So, what are the situations where CAS_DISK_CACHE is not used during a load process? When loading data from:

  • SASHDAT files in a PATH CASLIB on a SMP CAS environment
  • SASHDAT files in a co-located HDFS CASLIB on a MPP CAS environment
  • SASHDAT files in a DNFS CASLIB on a MPP CAS environment

Why? Those SASHDAT files can be directly memory-mapped by CAS and thus represent the backing store. There is no need to copy blocks to the CAS_DISK_CACHE location in that case.

Still with me? Good! Now that we understand the rule (CAS uses CAS_DISK_CACHE) and the exception to the rule (except when SASHDAT source), let’s dive into the exceptions to the exceptions to the rule…

Loading SASHDAT files from a PATH CASLIB on a MPP environment or from a remote HDFS CASLIB on a MPP environment DOES USE CAS_DISK_CACHE. This is because the blocks cannot be directly memory-mapped by the CAS workers.

 

1a – Loading from SASHDAT using row filters or column selection

 

Even when you think you are NOT going to use CAS_DISK_CACHE, CAS still might use it.

When you load data into CAS, you might not want all the observations or all the columns of a data source but maybe some of the records according to a specific filter, or some of the variables using a column selection. You can do that by specifying a WHERE clause or VARS (VARLIST) condition in the loading process (for example in CASUTIL). In that case, even if CAS_DISK_CACHE is not supposed to be used (SASHDAT files on co-located HDFS or DNFS or SMP-only PATH locations), then it is used to cache the data source subset. Because CAS has to evaluate the where clause on each row, and because the output table is different than the original SASHDAT source file then CAS_DISK_CACHE is used as the backing store for that table. CAS cannot rely on the original SASHDAT file to load missing blocks.

The size of the space taken in CAS_DISK_CACHE is dependent on the number of records filtered and number of columns selected and thus is smaller than the original SASHDAT file size.

 

1b – Loading from SASHDAT using encryption

 

You might want to save data as encrypted SASHDAT files on co-located HDFS or DNFS or SMP-only PATH locations (focusing on the cases where CAS_DISK_CACHE is not used during loading processes).

In these situations, the loading of an encrypted SASHDAT file uses CAS_DISK_CACHE. The data blocks are not loaded into memory encrypted. They are decrypted in CAS_DISK_CACHE.

 

2 – Failover (COPIES)

 

The backup copies of the data blocks that are needed for failover are also stored in CAS_DISK_CACHE.

This applies to all tables that use CAS_DISK_CACHE for the backing store in MPP environments (no COPIES in SMP). Tables that don't use CAS_DISK_CACHE (remember SASHDAT files in co-located HDFS on MPP, SASHDAT files in DNFS) for the primary backing store, don't use it as well for copies. They rely on either the HDFS or DNFS capabilities for retrieving and reloading the missing blocks.

The following figure depicts the normal CAS_DISK_CACHE usage for copies situation:

nir_post22_01_copies.png

 

3 – Appending data

 

Appending data to a CAS table is possible in the current version.

Appending data behaves like loading data or creating tables in CAS from a CAS_DISK_CACHE perspective. The data blocks that hold the new records are written down to disk in the CAS_DISK_CACHE location.

When appending, CAS_DISK_CACHE is used regardless of the target table’s backing store. For master CAS tables that already make use of CAS_DISK_CACHE, new blocks are created next to the existing ones. For master CAS tables that don't make use of CAS_DISK_CACHE (SASHDAT files on either co-located HDFS or DNFS or SMP-only PATH locations), the new (appended) blocks are written to CAS_DISK_CACHE. In this “hybrid map” case, blocks that are unmodified remain mapped to their source (i.e. co-located HDFS or DNFS or SMP-only PATH) while the new blocks (append) must be mapped to their newly created locations in CAS_DISK_CACHE.

The following figure depicts this "hybrid-map":

nir_post22_02_append.png

 

4 – Updating data

 

Updating data in place is limited to the use of the table.update CAS action. It simply updates certain columns of rows of a CAS table, according to a given where clause.

To keep things simple, each data block that contains rows that are to be updated (according to the where clause), are replicated in memory, updated, and replace the original blocks.

If the CAS table already uses the CAS_DISK_CACHE, then the new blocks are written to CAS_DISK_CACHE, and right after, the old blocks are removed from CAS_DISK_CACHE. So, for a limited period of time, you can observe an overhead in the use of the CAS_DISK_CACHE.

If the CAS table does not use CAS_DISK_CACHE (SASHDAT on co-located HDFS, DNFS or SMP-only PATH), then, right after the update, all the table blocks are persisted in CAS_DISK_CACHE. So, the table switches from no CAS_DISK_CACHE usage to CAS_DISK_CACHE usage.

 

5 – Partitioning data

 

Partitioning is the process for organizing a CAS table in memory over multiple nodes according to variables identified as key variables. This facilitates and accelerates operations like joins, group-by queries or any CAS action that requires by variables.

Partitioning can be run persistently (you decide to organize a CAS table in a partitioned way) or CAS can do that on-the-fly (this is actually called "by-group processing") when it needs it.

Persistent partitioning is simply creating or overwriting a CAS table. In other words, it naturally uses CAS_DISK_CACHE. If the CAS table uses CAS_DISK_CACHE then during the partitioning, it creates a new set of blocks with data partitioned accordingly in the CAS_DISK_CACHE, and once finished removes the old blocks (assuming you replace the original table by the partitioned version of the same table but you can obviously create a second version of the CAS table). If the original unpartitioned CAS table does not use CAS_DISK_CACHE, then the new partitioned version of it uses it, because it's a brand new CAS table. There is no longer a mapping between the CAS table and the SASHDAT source file in this case.

 

5b – On-the-fly partitioning (aka BY-GROUP processing or auto-partitioning)

 

Some CAS processes and/or VA analyses may create temporary partitioned copies of CAS tables prior to execution for BY-GROUP optimization (e.g. fedSQL JOIN). On-the-fly partitioning uses CAS_DISK_CACHE temporarily to hold the temporary partitioned CAS table required by the CAS action that is running. The data blocks in CAS_DISK_CACHE are removed immediately after the end of the CAS action.

 

6 – Full table replication

 

Some operations in CAS work better with a single server operating against a complete copy of the data table. To optimize these scenarios, CAS offers full table replication.

 

6a – Explicit replication

 

One can explicitly replicate a table persistently in CAS, using the REPEAT/DUPLICATE options. The target tables are called "repeated" tables. As we cannot use these options for server-side loading, repeated tables always use CAS_DISK_CACHE. As an example, a 100MB CAS table on a 100 worker nodes CAS environment will use 10,000MB (100 X 100MB) worth of space.

 

6b - Replicating the smallest table (join optimization)

 

For join optimization, CAS will often replicate the smaller of the two tables being joined. This “replicating the smallest table” is the second type of data movement that can happen behind the scenes during some SQL join operations or special analytics processing. On-the-fly "replicating" uses CAS_DISK_CACHE temporarily to hold the temporary replicated CAS table required by the CAS action that is running.

As with on-the-fly partitioning, the data blocks in CAS_DISK_CACHE are removed immediately after the end of the CAS action.

 

6c – Analytics processing / VA analysis

 

Viya and CAS provides many analytical methods to help solving various business problems. While some of these procedures do completely replicate their input tables on every worker node, many are more complicated in how they process data. Depending on the analytical algorithm, CAS can behave differently (extract from the documentation😞

  • Some algorithms run only in SMP mode, even if you have a MPP CAS environment. Thus, a particular CAS worker is chosen randomly to be the processing machine and the data from all other machines are moved over to this processing machine. CAS_DISK_CACHE is then used on this CAS worker to hold the cached copy of the CAS table, even if the original table doesn't make use of CAS_DISK_CACHE. Example is MINSPANTREE statement in the OPTNETWORK procedure.
  • Some other algorithms run in MPP mode on a portion of the data. But the data need to be organized wisely before. This is on-the-fly by-group processing that we described earlier. This uses CAS_DISK_CACHE.
  • Finally, some of them run in MPP mode concurrently. The analytical algorithm tries to solve the same problem on each grid node with different algorithm options. Consequently, the CAS table is replicated on all the nodes. This tremendously makes use of CAS_DISK_CACHE. Example is OPTMILP procedure in CONCURRENT mode.

SAS Optimization, where those examples come from, is a good example of various algorithms using CAS in different ways. There are probably many additional examples.

 

Additional considerations

 

Loading data in CAS from compressed SASHDAT files located on co-located HDFS, DNFS or SMP-only PATH, does not use CAS_DISK_CACHE. In these cases, SASHDAT files are lift into memory as they are. But if you use a filter while loading, then you fall into the filtering behavior (#1a).

Note also that each time you create an output table in CAS to store your results, you use CAS_DISK_CACHE. I already mentioned that in #1 but that's worth to mention it again.

 

Concluding thoughts

 

What does all this mean? How much CAS_DISK_CACHE do you need? Well, let’s look at an example to get an idea.

Say you have a 100MB table.

  • When you load it, it will take up ~100MB of space in CAS_DISK_CACHE.
  • Say you set COPIES to 3 (2 additional data blocks for failover). Now, you’ll have ~300MB.
  • The table will get used in various joins and other processes so it might be auto-partitioned by a few different join keys (say 2). Assuming these processes might run in concurrently, that adds another ~200MB (~100MB for each auto-partition scheme). So now we’re up to ~500MB peak size.
  • The table will also be used by a few different analytical procedures and/or joins which will fully replicate it across every worker. If we have 10 workers, that's ~100MB X 10 = ~1000MB. Assuming concurrency of different analytic processes, the auto-partitioning might occur at the same time as the replication. So, adding this to our running total, we get ~1500MB peak.

While extreme, the above example shows how CAS_DISK_CACHE can get a lot bigger than we might expect. Remember to always contact SAS for sizing.

 

Summary

 

Use of CAS_DISK_CACHE?

Yes/No

SASHDAT file in:

- SMP-only PATH CASLIB

- MPP co-located HDFS CASLIB

- MPP DNFS CASLIB

All other cases
Load data w/o changes N Y
Load subset of data (WHERE or VARS) Y Y
Encryption Y Y
Compression N Y
Copies (MPP-only) N Y
Append Y Y
Update Y Y
Partition/By-group Y Y
Replication Y Y

 

Notice that a couple of options may affect how CAS_DISK_CACHE is used: blocksize (MAXTABLEMEM), CAS table scope (session or global), and COPIES.

Comments

I'm having a hard time finding the CAS cache. For my experiment I am loading a srctype="PATH" 100 GB sashdata file. It was created with compression set. I bumped copies up to 5, and see in gridmon that there's 500 GB of owned disk space. I found where we have CAS_DISK_CACHE set, checking from the vars.yaml config all the way down to the individual worker node_usermods.lua files. But I've logged on to all the machines in our Viya environment, checked the CAS_DISK_CACHE locations, as well as /tmp to be safe, and I can only a few MB here and there. Pretty sure we have MPP because we have multiple CAS worker nodes. Any thoughts please? Thanks...

Hello @hrczeglo 

 

How do you check the contents of the CAS_DISK_CACHE? Because CAS_DISK_CACHE files are hidden, you cannot see them using standard commands. You can start checking the filesystem usage before and after the load. Also, check if you really are in MPP mode (when you start a CAS session, check the number of workers assigned) otherwise you might hit the direct memory-mapping use case.

 

Regards,

Nicolas.

@NicolasRobertThanks for responding I've tried "find", "du" & "ls -a" on the directories, with sudo.

 

I can see the 500 GB usage via the instructions here:

 

SAS Help Center: List CAS Disk Cache Information

 

but don't see any files in the directory through shell.

As I said earlier, you cannot see them because they are hidden.

The only way to see them is to use the lsof command on the cas process. But depending on your environment, it might be tricky to run it. Here are some examples:

 

sessionPID=`ps -u ${userid} -o user:12,pid,ppid,stime,cmd | /usr/bin/grep cas | grep -v grep | awk '{print $2}'`
masterPID=`ps -u cas -o user:12,pid,ppid,stime,cmd | grep "cas join" | awk '{print $2}'`

echo
echo "*** Session CAS tables: files in CAS_DISK_CACHE ***"
echo
lsof -a +L1 -p ${sessionPID} | grep _${sessionPID}_

echo
echo "*** Global CAS tables created in this CAS session: files in CAS_DISK_CACHE ***"
echo
sudo -u cas lsof -a +L1 -p ${masterPID} | grep _${sessionPID}_
Version history
Last update:
‎03-10-2020 02:47 PM
Updated by:
Contributors

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags