In a previous post, I gave some specific strategies for optimizing Visual Analytics reports by optimizing the underlying CAS tables. Let's take another look at this topic, but this time let's work from a specific example with Viya 3.4.
In that article, I listed a number of techniques for optimizing CAS tables including:
(Of course now CAS also offers indexes to optimize WHERE operations)
In this post, let's look at two of these techniques -- reducing table width with VARCHAR and table partitioning -- to see the relative impact of each in a real situation.
To test these two techniques, we'll perform the same CAS action against different versions of the same CAS table where we've either applied or not applied these techniques.
To get a sense of how much each strategy affects performance, we'll start with three extra large versions of the MegaCorp table. Each of our expanded MegaCorp tables has 46 columns and 7,961,290 rows but they each utilize different storage options. The three versions include:
The thin table contains all of the information contained in the wide table but with less padding due to the use of VARCHAR data types. Thus the thin table is only 40% the size of the wide table ( (3732011500 + 292734220) / 98210500000 ). The partitioned table is a copy of the thin table but is partitioned by the Unit field. In our test, we'll group by Unit to see how much pre-partitioning helps in this case.
Comparing CAS' performance against these tables will give us an idea of how much shrinking record width and/or partitioning impact performance.
From bar charts to line graphs, many VA objects utilize the Simple.Summary CAS action. To see this, we can request query diagnostics in VA by pressing cntrl-alt-q. In the example below, we see the Simple.Summary action that produced this VA bar chart.
Bar Chart issuing a Summary CAS Action |
To simulate a VA report object and gauge its performance on our input tables, we use the following CAS action. It essentially aggregates the input table by the Unit field producing average values for two measure fields:
proc cas;
simple.summary / table={name=”(((Input Table)))” caslib=”dm” groupby={“unit”}}
subSet = {“MEAN”} inputs = {“Profit” “Revenue”};
run;
Executing the above CAS action multiple times against the three CAS tables gave the following results (Please note that these results were achieved on under-sized virtual hardware. So it is impossible to derive any overall performance information from them. We are just looking at relative performance here, while using the same hardware) :
Input CAS Table | Average CPU Time (sec) | % Improvement |
Wide MegaCorp | 98.9 | |
Thin MegaCorp | 14.67 | 85% |
Thin, Paritioned MegaCorp | 13.02 | 11.2% |
The first thing that jumps out from the test results is a huge performance enhancements from decreasing the table's record length. In this test, the reduction was accomplished using VARCHAR in place of long CHAR fields but it can also be done using custom formats (substituting short codes for long descriptions) and/or removing unnecessary fields. While the size reduction in this test was quite significant (60%), even modest size reductions can result in improved performance.
While not as dramatic as record-width reduction, pre-partitioning also leads to a fairly significant reduction in CPU time proving this is an effective strategy for reducing CAS CPU burden and delivering results faster.
While CAS utilizes sophisticated multi-machine big data processing techniques, you can still coax better performance out of it by optimizing the storage options of your CAS tables.
The above results confirm our previous findings that minimizing record length in CAS is critical for optimal performance. So, where appropriate, utilize VARCHAR in place of CHAR, replace long character fields with short codes (and/or utilize custom formats), and remove unnecessary fields.
Likewise, where appropriate, pre-partition your data on common GROUP-BY variables. Good candidate variables will be used often for grouping with a medium cardinality -- not too many values (e.g. Using a unique key would give you as many partitions as rows) but not too few values either(e.g. 4 values for a 10,000,0000 row table would result in too few partitions).
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.