Let's take a look at what makes CAS tables big and what makes them small and see what we can do to limit any unnecessary girth.
1. Memory Alignment (8 Byte Boundaries)
CAS tables are tightly integrated with computer memory. In many respects, CAS tables are stored as memory and, as such, they share many of the attributes of computer memory. One such attribute is data structure alignment. While this topic is quite technical, it basically means that CAS tables will contain extra padding between fields.
Like LASR before it, CAS conforms to memory alignment requirements by aligning all field data along 8 byte boundaries. So, while CAS allows any length of character data (CHAR(1), CHAR(2), CHAR(3), ....) which is useful for domain constraint restrictions, the net effect of each field's length on memory and disk (CAS Disk Cache and/or SASHDAT) is always rounded up the closest multiple of 8 bytes. So, as the diagram below depicts, a char(5) will be padded with 3 bytes so that the next field can be placed on the next 8 byte boundary. Thus, the net effect of the char(5) field is 8 bytes of storage. Similarly, the net effect of a char(12) would be 16 bytes.
Example: 8 Byte Alignment Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Growth due to memory alignment should be considered during all aspects of the CAS project from system sizing to data design. For example, splitting out an 8 byte description field to a dimension table (which would normally have a positive impact on sizing assuming a 1 byte foreign key) would have a slightly negative impact on sizing in CAS compared to simply putting the field on the fact table since you would need a minimum 8 byte foreign key on both tables to reference it.
A CHAR(3) and CHAR(3) take up as much space as a CHAR(8) and a CHAR(8) due to memory alignment.
Finally, growth due to memory alignment will occur in RAM (CAS data that is residing in active memory) as well as CAS Disk Cache and SASHDAT since these on-disk locations can act as virtual memory and thus must adhere to the server's memory alignment requirements.
2. Transcoding Multiplication Factor
All CAS data is transcoded in UTF-8. UTF-8 is a variable width encoding system requiring 1 byte to store some characters and up to 4 to store others.
When source data is encoded in something other than UTF-8, the data may require more space in CAS than it did in the source system. As a turn-key solution to this potential expansion, CAS offers the NCHARMULTIPLIER parameters (Doc links: CASUTIL, System Option, CVP Libname, CAS Connectors) to define how much to expand character fields in order to avoid truncation.
The NCHARMULTIPLIER parameter can be set anywhere from 1 to 4. So, depending on setting, a CAS character field can be 1 to 4 times as big as the source field. For example, a 100GB (all character) CAS table with NCHARMULTIPLIER=1 would be 100GB (ignoring all other sizing considerations). The same table with NCHARMULTIPLIER=4 would be 400GB.
Effect of NCHARMULTIPLIER=4 on an 8 character field
If the NCHARMULTIPLIER setting is set too low for the character data (e.g. Setting it to 1 for a Double Byte Character Set), then the data could be truncated. So be careful. For a better understanding of SAS and transcoding, see the documentation, and for further understanding, see this paper. There is also a FAQ for Viya and encoding.
While SAS makes an attempt to set the multiplier if you do not specify a value, if you are setting it manually, analyze the encoding and the data and choose the lowest setting the data will allow, or, even better, analyze each field and specifically set its length based on its contents and encoding (making sure to set the NCHARMULTIPLIER to 1 since you're managing the sizing yourself). Going field-by-field will allow for the smallest possible table size.
(If you're really confused about encoding and what it all means, this post should help.)
As with memory alignment, growth due to the transcoding multiplication factor will occur in RAM as well as CAS Disk Cache and SASHDAT.
3. VARCHAR to CHAR Conversion
As discussed previously, CAS contains a variable length character data type (VARCHAR) which, in certain circumstances, can save lots of space when compared to the fixed length character type (CHAR).
When server-side loading, CAS automatically maps incoming variable length character fields (e.g. REDSHIFT VARCHAR) to the CAS VARCHAR data type. However, CAS data can often be unknowingly sent to SAS (by running non-CAS-enabled procs, using non-CAS-enabled DATA Step functions, etc.) where VARCHAR fields will be converted to CHAR. This can lead to the field ultimately being stored in CAS with a CHAR data type (fixed length).
To mitigate unwanted VARCHAR to CHAR conversion, be aware that client-side processing in SAS can lead to data type changes. If non-CAS processing is required or if you're just loading CAS via a SAS client, be sure to set the VARCHARCONVERSION option.
4. Using VARCHAR When You Shouldn't
The way CAS stores its variable length data types impacts how you should apply the data type. Like SAS, behind the scenes, CAS depends mainly on a fixed length data structure to store CAS tables. Variable length fields are stored in a separate data structure. While a 16 byte "pointer" links the fixed and variable length data. Thus every VARCHAR field in CAS brings with it an extra 16 bytes of overhead.
CAS Variable Length Storage Scheme
Given this overhead, you'll only want to utilize VARCHAR on long character fields that are highly variable in length like descriptions. When applied in those scenarios, VARCHAR will save lots of space. Utilize CHAR for variables under 16 bytes and analyze the lengths of the data values to see if VARCHAR makes sense for longer fields.
As with the previous considerations, growth due to sub-optimal VARHCAR usage affects memory as well as the CAS Disk Cache and SASHDAT.
5. Failover (COPIES)
While more about data architecture than CAS table design, the COPIES option will definitely impact the CAS-Disk-Cache representation of the CAS table. The parameter basically specifies how many back-up copies to keep for the CAS table. Since CAS tables are spread over multiple machines, CAS spreads extra copies of the tables' pieces around too so that if one machine fails, the extra copies can be used in place of the pieces that were on the failed machine. Setting the parameter to 1 means that there will be one extra copy. Set the parameter to 2 and there will be 2 extra copies. These copies do not impact memory however. They only impact CAS Disk Cache.
CAS Disk Cache Table Blocks Being Promoted after a Node Failure
Using CAS' COPIES feature for failover is not always required. Often there is no requirement for a specific table to be up 24X7. So COPIES can be set to 0 and, if something happens, the table can simply be reloaded. Additionally other redundancy technology might already be in place such as SASHDAT on highly available HDFS. In cases like these, COPIES should be set to 0 since failover is being handled in a different way.
In general, setting COPIES to 1 doubles the size of the CAS table on disk. Setting it to 2 triples it and so on.... Therefore set it as low as possible to get the smallest CAS Disk Cache representation of your CAS table.
6. Repeated Tables
While COPIES is used for failover, the REPEAT option is used for CAS Action optimization (~Query Optimization). Unlike COPIES which creates inactive copies of the CAS table for failover, REPEAT creates a full active copy of the CAS table on every CAS worker node. As mentioned, this is done for optimization. For example, joining a large table that is distributed across all of the worker nodes to a smaller table that is repeated on every node is much more optimal than joining the two when each is distributed. View more information on both the COPIES and REPEAT options.
REPEATED CAS Table
While the REPEAT option can definitely improve performance of some CAS actions, it will definitely increase the overall size impact of the CAS table since you will get n copies of the table where n is the number of worker nodes. So if you had 10 worker nodes, a repeated table would take up 10 times as much space. This impact would be on disk in the form of CAS Disk Cache as well as memory when the CAS table is in use.
7. Indexes
I mention CAS indexes for completeness because they definitely increase the size of the table. However, the size impact of indexes is quite small and is a minor consideration considering the positive impact they have on performance. In general you can expect a size increase of 5% or less for unique indexes and 1% to 3% for non-unique indexes depending on cardinality. For more on CAS indexes, check out this post.
8. Compression
Compression, of course, is another consideration when designing/sizing CAS tables. The CAS implementation is very similar to the LASR implementation and our initial tests indicate you should expect a similar performance impact. View the latest info on compression in CAS and what to expect if you choose that route.
9. Auto-Partitioning and Auto-Replication
While not technically part of the table sizing process, CAS data architects should be aware of the way CAS creates temporary data structures to optimize CAS actions. Particularly, when performing BY-Group operations, joins, and certain analytical actions, CAS will temporarily replicate the table in one of two ways. Depending on the situation, CAS may "auto-partition" the table or "auto-replicate" it. In auto-partitioning, CAS copies and re-organizes the CAS table according to the BY-Groups required. In auto-replication, CAS distributes full copies of the table to every CAS worker node (like a temporary REPEAT).
Data Movement for CAS Action Optimization
So, while you might save space not replicating (REPEAT-ing) your CAS table up front, it could be needlessly replicated over and over again anyway to satisfy optimization algorithms. Similarly, you should consider partitioning your CAS table upon load to avoid partitioned copies of it being created to satisfy end-user requests from VA and the like.
10. Interactions Among Sizing Considerations
Finally, please note that the above considerations do not exist independent of each other. If a CAS table grows due to memory alignment, that growth will also further impact failover redundancy (COPIES), optimization redundancy (REPEAT), etc.. Conversely, increasing a character field's length from 3 bytes to 6 bytes to account for transcoding will have no net effect on sizing due to 8 byte memory alignment. So make sure to consider the complete picture when sizing.
... View more