For those of you who have read the SAS Viya 3.5 for Linux: Deployment Guide and deployed Viya, you will have come across CASDATADIR references. Interestingly it does not get mentioned much elsewhere, which may seem odd given it's purpose. If you are a technical architect or are working with customers to ensure performance factors beyond the CAS Disk Cache, networking and optimum data loading techniques then read on.
Let's start with some content from SAS Viya 3.5 for Linux: Deployment Guide:
By default, product caslibs are written to /opt/sas/viya/config/data/cas/default, which is often hosted on a single hard disk drive with limited storage. To ensure proper performance of your SAS solutions, SAS recommends that the CASDATADIR option be configured to point to a high-performance storage platform. Examples of high-performance storage platforms include SAN, NVMe, and multiple drive disk arrays. ... Changing the CAS data directory is especially useful for solutions that can be resource-intensive, such as SAS Visual Forecasting, SAS Visual Data Mining and Machine Learning, SAS Visual Text Analytics, and SAS Analytics for IoT. Multiple predefined system caslibs and the Public caslib have a default location for persistent storage: /opt/sas/viya/config/data/cas/instance-name/name-of-Public-caslib. You can specify the instance name when you edit the vars.yml. If you anticipate that many users will use browsers to access the user interfaces and to import data from files, additional space for this file system will be required. SAS recommends monitoring disk usage at /opt/sas/viya/config/data/cas (assuming this the value of CASDATADIR).
Acknowledgements. Before I go any further I wanted to call out this post reflects conversations and information my colleagues here at SAS have shared. My appreciation to them for their input.
Useful data points
CASDATADIR - also termed CAS Data Directory
By default the directory for CASDATADIR is /opt/sas/viya/config/data/cas/default
The CASDATADIR is defined in the CAS_CONFIGURATION block of the vars.yml file
When installing and configuring an MPP CAS Server, the CASDATADIR is configured on the CAS Controllers (Primary and Secondary) as well as the CAS Workers
In practice/the default action is for applications like SAS Visual Forecasting, SAS Visual Data Mining and Machine Learning, SAS Visual Text Analytics to write results tables to the sub-directories within CASDATADIR only on the CAS Controller i.e. files small or big are written as a block (they are not distributed blocks of data across CAS Workers)
Architecture considerations based on the data points
In some customer environments where they have heavy analytical usage (users + volume of data), the CAS Controller could get really busy
Ensuring the sufficient bandwidth/throughput between the CAS Workers & CAS Controller becomes really important
Writing out data to CASDATADIR (and reading it from) as quickly possible, should be a goal to minimise the impact on CAS Controllers's resources i.e. i/o for the CAS Controller needs to meet the usage patterns
Knowing what and when data is getting written to CASDATADIR can help inform customer teams on the performance requirements within architecture design
Knowing of methods to prevent unnecessary writing to CASDATADIR location might prove beneficial e.g. have large output tables produced through modelling, to be written to HDFS or DNFS
For some customers, backing up the CASDATADIR structure is likely to be a useful recommendation
What's in CASDATADIR?
A quick directory listing of the Viya 3.5 environment we use for our teams testing here at SAS shows this, with the root directory being /opt/sas/viya/config/data/cas/default. FYI. 'default' refers to the name of the CAS Server. If an environment has multiple CAS Servers, then something other than 'default' is likely to listed as that directory. Some additional information regarding the predefined Caslibs can be found in the SAS Viya 3.5 Administration: Data document. Sub directories for individual users can be found under 'casuserlibraries' and sub-directories for individuals projects can be found under the 'projects' directory.
drwxr-xr-x. 2 cas sas 4096 Dec 4 17:48 appData
drwxr-xr-x. 55 cas sas 4096 Dec 4 17:48 casuserlibraries
drwxrwxrwx. 2 cas sas 64 Feb 10 02:30 formats
drwxrwxr-x. 2 cas sas 6 Jun 26 2018 modelMonitorLibrary
drwxrwxrwx. 2 cas sas 4096 Dec 2 16:34 models
drwxr-xr-x. 2 cas sas 6 Jun 26 2018 modelStore
drwxr-x---. 16 cas sas 4096 Nov 26 14:13 projects
drwxrwxrwx. 2 cas sas 4096 Dec 4 17:48 public
drwxr-xr-x. 2 cas sas 6 May 30 2019 qasMartStore
drwxr-xr-x. 2 cas sas 4096 Dec 4 17:48 referenceData
drwxr-xr-x. 2 cas sas 4096 Jun 26 2018 samples
drwxr-xr-x. 2 cas sas 103 Feb 7 15:54 search
drwxr-xr-x. 2 cas sas 6 Jun 26 2018 sysData
drwxrwxrwx. 2 cas sas 6 Jun 26 2018 vamodels
CASUSER libraries - to write or not to write to CASDATADIR
Users of visual applications can choose to write to the CASUSER directory from the visual applications. The decision on whether the CASUSER libraries are written to the CASDATADIR directory structure is based on:
the type and interface being used
the membership of users to the CASHostRequired custom group.
The table below from the SAS® Viya® 3.5 Administration: Identity Management document explains this clearly. See table 1 below.
User Scenario
CASUSER Path Location
Session Information
User starts CAS sessions from visual interfaces (includes all SAS Viya interfaces except SAS Studio 4 and Base SAS or SPRE sessions), and user is not a member of the CASHostAccountRequired custom group. This is the default behavior.
/opt/sas/viya/config/data/cas/default/
casuserlibraries/username
Sessions run under the CAS server user (cas). The directory and all files within it are owned by the cas user.
User starts CAS sessions from visual interfaces (includes all SAS Viya interfaces except SAS Studio 4 and Base SAS or SPRE sessions), and user is a member of the CASHostAccountRequired custom group.
$HOME/casuser
Sessions run under the user’s host account.
User starts CAS sessions from SAS Studio 4, Base SAS, or SPRE, regardless of whether the user is a member of the CASHostAccountRequired custom group.
$HOME/casuser
SAS Studio 4, Base SAS, and SPRE sessions always run under the user’s host account, and use the $HOME/casuser CASUSER path location.
Sessions run under the CAS server user (cas). The directory and all files within it are owned by the cas user.
One item that may be worth mentioning here is when the user is a member of the CASHostAccountRequiredGroup and uses the visual interfaces to store data in the CASUSER caslib. The $HOME directory may result in out-of-space issues as some customer IT teams may restrict the size of the $HOME directory per user.
Large input tables & CASDATADIR
For users of VDMML (Model Studio interface) they may be familiar with the fact that on the first run of the Data Node, the source data is copied to the CASDATADIR directory structure e.g. /opt/sas/viya/config/data/cas/default/projects/datamining-0abc3abf-9ad2-477a-89b5-989f1e4cfe9a. The good news is that very recently a method to prevent that happening was documented in the latest version of the Model Studio 8.5: SAS® Visual Data Mining and Machine Learning 8.5: Advanced Topics document. Here is the text:
Model Studio copies the data source when the first Data node is run. This can cause performance issues and can cause you to run out of disk space. The amount of space that is required depends on the number of saved projects and on the size of the data source.
To prevent Model Studio from automatically creating copies of your data, ensure that the following conditions are met:
A Key variable exists in your data. This can be either a variable named _INDEX_ or a variable that is assigned the role Key.
A Partition variable exists in your data. This can be either a variable named _PARTIND_ or a variable that is assigned the role Partition.
The data must be persistent on the disk.
For 3., there here is some additional clarification. The table must be loaded from the caslib source directly. You can not use proc casutil to load a table into a caslib which has different source path than the table.
Large output tables & CASDATADIR
Depending on the type of analytics being done, it may require that the original data be written in full to an output table. The output table will contain additional columns being appended e.g. the columns will contain predicted or forecasted values and the delta between predicted/forecasted values etc. Consider output from SAS Visual Forecasting, where there have been massive input tables due to the nature of what is being forecasted e.g. groceries, values of stocks and shares, etc. Since the forecasted values will need to be used further down the business process, the likelihood is the table will need to converted into another format e.g. CSV, for other applications to leverage. Therefore when working with large output tables, the customer team need to consider when and where to place the output tables, and when to convert them into another format. If a customer team has a mixture of VA users, data scientists and forecasting specialists all using one CAS Server, it may be preferable to initially write the output table into HDFS or DNFS as a SASHDAT file. Then later in the day (outside of normal office hours) write the table out as CSV file and make it available to the business users outside of the Viya environment. Writing very large output tables to the CASDATADIR location (think 10 GB and upwards) during normal office hours may impact the user experience for multiple user groups (assuming they all share one CAS Server.
Summary
Knowing the purpose of the CASDATADIR directory and it's sub-directories will help technical architects, users and administrators alike. It will hopefully limit unwanted end-user experiences and contribute to a performant SAS Viya environment. As always comments are welcome, especially if you think there is something which you consider could add value or needs clarifying/correcting.
Thanks, Simon
... View more