CAS Indexes, First Look (Viya 3.3)

2 Likes

There are quite a few notable data management related enhancements in CAS for Viya 3.3. One such improvement is the new CAS table index functionality. Let's look at it here so we can better understand how and when to utilize it.

.

Creating CAS Indexes

Currently CAS indexes can only be created via CAS Actions. Any action that can create a CAS table (includes a CASOUT option) can also index that table. As an example, the loadTable action is shown below.

Proc cas;
Session mySession;
   table.loadTable /           
      path="prdsale.sas7bdat"
      casOut={name="prdsaleInd" indexVars={"key"} replace=True}; 
run;

Additionally there is a dedicated Index action within the Table action set.

Proc cas;
Session mySession;
  table.index /
    table="prdsale"
    casOut={name="prdsaleInd" indexVars={"key"} replace=True}; 
run;

NB:

In this release, neither CASUTIL nor SPRE (DATA Step, Procedures, etc.) support CAS index creation.

So, client-side sources will have to be loaded and then indexed in a two step process.

.

Using CAS Indexes

In the 3.3 release of Viya, CAS indexes are only used by subsetting WHERE filters. Below shows the same WHERE filter applied to the same 9 million row CAS table with and without indexes. Note that the indexed version is an order of magnitude faster than the un-indexed version.

While not classic B-Tree indexes, CAS indexes are similarly best used on high to medium cardinality fields (fields with lots of unique values). There should always be a measureable reduction in CPU time when using a WHERE filter to subset on such indexed CAS Table column(s). Such CPU time reductions when using WHERE on very large CAS Tables are very important when multiple CAS Sessions are each running resource-intensive CAS Actions. Measures of Elapsed Time reduction depend on one or more the following:

How busy is the SMP machine or CAS MPP nodes running the action using a WHERE filter?
How much of the indexed table is already RAM-resident?
Is there enough RAM for the table to all fit in memory?
How many CAS Workers are assigned to the Session running the CAS Action using a WHERE filter?
What is the distribution of CAS Table Blocks among these MPP workers handling the query?
What is the total time expended on WHERE filtering versus the remaining analytics computations, etc. that the corresponding CAS Action does?

.

NB:....

In Viya 3.3, CAS indexes are not used in join optimization, for BY-Group processing, or by other process algorithms (e.g Transpose, Machine Learning). They are only used in sub-setting operations. So, they should only be applied when sub-setting is expected. For BY-Group optimization, utilize CAS partitioning and ordering instead of CAS indexes.

.

Physical Manifestation

Unlike other implementations, CAS indexes are stored inside their respective tables. There are no separate index files (e.g. no .sashbndx files). While it's not easy to see how CAS tables physically manifest, if you did go looking for the indexes, you would just see more data blocks. Similarly, when saved to SASHDAT, indexes manifest within the SASHDAT table. So again, there will be no separate index files.

.

Size Considerations

Since we expect to see more data blocks, it should be no surprise that CAS indexes increase the overall size of CAS tables. As with any index implementation, CAS indexes improve data retrieval speed at the expense of storage space.

The size of the indexes depends mainly on field size and cardinality (the number of unique values) as well as the number of rows in the table.

For reference, here are the tableDetails output from two different indexing strategies. The first shows a single index on a unique, numeric field on a 9 million row table. The second one shows a non-unique index on a character field on the same table. Indexing the unique field consumes about 117MB while indexing the non-unique field (48 unique values) only takes 49MB. While every situation is different, this should give you an idea of the order of magnitude expected.