SAS Viya 3.4 Case Study: Optimizing CAS Tables for VA

2 Likes

In a previous post, I gave some specific strategies for optimizing Visual Analytics reports by optimizing the underlying CAS tables. Let's take another look at this topic, but this time let's work from a specific example with Viya 3.4.

In that article, I listed a number of techniques for optimizing CAS tables including:

Partitioning / Sorting
Reducing Table Width (record length):
- Using VARCHAR
- Using User Defined Formats
Using Repeated Tables

(Of course now CAS also offers indexes to optimize WHERE operations)

In this post, let's look at two of these techniques -- reducing table width with VARCHAR and table partitioning -- to see the relative impact of each in a real situation.

To test these two techniques, we'll perform the same CAS action against different versions of the same CAS table where we've either applied or not applied these techniques.

The Input Tables

To get a sense of how much each strategy affects performance, we'll start with three extra large versions of the MegaCorp table. Each of our expanded MegaCorp tables has 46 columns and 7,961,290 rows but they each utilize different storage options. The three versions include:

A wide version that uses long CHAR variables
A thin version that uses VARCHAR variables
A thin version partitioned by the Unit variable

The thin table contains all of the information contained in the wide table but with less padding due to the use of VARCHAR data types. Thus the thin table is only 40% the size of the wide table ( (3732011500 + 292734220) / 98210500000 ). The partitioned table is a copy of the thin table but is partitioned by the Unit field. In our test, we'll group by Unit to see how much pre-partitioning helps in this case.

Comparing CAS' performance against these tables will give us an idea of how much shrinking record width and/or partitioning impact performance.

The Test: Simple Summary CAS Action

From bar charts to line graphs, many VA objects utilize the Simple.Summary CAS action. To see this, we can request query diagnostics in VA by pressing cntrl-alt-q. In the example below, we see the Simple.Summary action that produced this VA bar chart.

Bar Chart issuing a Summary CAS Action

To simulate a VA report object and gauge its performance on our input tables, we use the following CAS action. It essentially aggregates the input table by the Unit field producing average values for two measure fields:

proc cas;
simple.summary / table={name=”(((Input Table)))” caslib=”dm” groupby={“unit”}}
                 subSet = {“MEAN”} inputs = {“Profit” “Revenue”};
run;

The Results:

Executing the above CAS action multiple times against the three CAS tables gave the following results (Please note that these results were achieved on under-sized virtual hardware. So it is impossible to derive any overall performance information from them. We are just looking at relative performance here, while using the same hardware) :

Input CAS Table	Average CPU Time (sec)	% Improvement
Wide MegaCorp	98.9
Thin MegaCorp	14.67	85%
Thin, Paritioned MegaCorp	13.02	11.2%

Results Analysis

The first thing that jumps out from the test results is a huge performance enhancements from decreasing the table's record length. In this test, the reduction was accomplished using VARCHAR in place of long CHAR fields but it can also be done using custom formats (substituting short codes for long descriptions) and/or removing unnecessary fields. While the size reduction in this test was quite significant (60%), even modest size reductions can result in improved performance.

While not as dramatic as record-width reduction, pre-partitioning also leads to a fairly significant reduction in CPU time proving this is an effective strategy for reducing CAS CPU burden and delivering results faster.

Further Discussion

While CAS utilizes sophisticated multi-machine big data processing techniques, you can still coax better performance out of it by optimizing the storage options of your CAS tables.

The above results confirm our previous findings that minimizing record length in CAS is critical for optimal performance. So, where appropriate, utilize VARCHAR in place of CHAR, replace long character fields with short codes (and/or utilize custom formats), and remove unnecessary fields.

Likewise, where appropriate, pre-partition your data on common GROUP-BY variables. Good candidate variables will be used often for grouping with a medium cardinality -- not too many values (e.g. Using a unique key would give you as many partitions as rows) but not too few values either(e.g. 4 values for a 10,000,0000 row table would result in too few partitions).