The "CAS - Load Tables from Folders in Filesystem" Custom Step

1 Like

Modern, cloud-native analytical platforms and resources are expected to be more purpose-built and ephemeral.

Analytics practitioners therefore require easy tools to quickly hydrate a given environment with data of their choice.

Join me as I walk through a recent SAS Studio Custom Step: "CAS - Load Tables from Folders in Filesystem". Currently available here within the GitHub repository for SAS Studio Custom Steps, this facilitates easy data load from within a filesystem folder to in-memory Cloud Analytics Services (CAS) tables.

This post focusses on two aspects. "Using the Custom Step" refers to operational aspects of accessing and running the step. "Bigger Picture" deals with broader application areas and discussion on thought process and design. Feel free to skip ahead to "Bigger Picture" in case you are already familiar with SAS Studio Custom Steps and can follow this README.

Using the Custom Step

This Custom Step is part of the SAS Studio Custom Steps GitHub repository. Please note that this repository contains clear instructions to upload a selected custom step to SAS Viya, and I recommend you follow the same.

But, purely for awareness' sake and to expose multiple options available within SAS Studio, I also illustrate another process which uses Git integration within SAS Studio. This is shown in the following animated GIF.

The steps, in brief:

1. Follow these instructions to create a Git profile. Also, you may like to refer this video.

2. From Options -> Manage Git Connections, select Repositories -> Clone a repository.

3. Enter the following repo address :git@github.com:sassoftware/sas-studio-custom-steps.git and clone to a directory within your filesystem.

4. For easier access, move the Custom Step located within the folder "CAS - Load Tables from Folders in Filesystem" to a folder within SAS Content. This helps the custom step show up readily in the Shared pane of the Steps tab in SAS Studio.

As an alternative, you may like to just right-click on the desired custom step, download the same to your laptop, and then upload the same to any folder within SAS Content.

You are now ready to use this custom step! To quickly test the same, I use a set of example data files which I have saved to a folder within my filesystem. This data folder (which contains a mix of csv, excel, and a host of other files) was originally part of a demo repository which I (again) have saved on GitHub so that I can easily access the same from multiple places. Therefore, it becomes a simple matter to repeat the exercise to obtain data from this GitHub repository and save the source files within my filesystem folder in Viya.

Now that you have your data folder available in your filesystem, let's go ahead and run this custom step. Since this step involves loading data to CAS, ensure you have an active CAS connection. There are a variety of ways through which you can do this, a simple option is to run the following code snippet:

cas my_cas_session ; 
/* my_cas_session simply refers to a name I call my CAS session :) */

caslib _all_ assign;
/* The above optional statement ensures predefined caslibs such as Public etc. are made available to the compute session */

The next animated GIF demonstrates use of the custom step. The main input parameter required is the folder where the data resides. In addition, the following other parameters can be provided.

(i) Filter by Extension : Select the extension of the files within the directory, for e.g., load only CSV files or excel files.

(ii) File pattern : You may choose to provide a wildcard pattern to search for characters found in the filename. The wildcard pattern you provide are those provided in the SQL LIKE operator. Refer this page for an explanation.

Finally, you choose the target location (CASLIB) where you would like to load the table. You can either load AND promote those tables, making them global in scope, This option is checked by default. Or, you can load tables with local scope, meaning they can be used within that specific SAS Studio session alone.

Bigger Picture

While its premise may be simple (and seasoned programmers may react with a "Why don't we just code it?"), the creation of this custom step was an opportunity to consider principles regarding applicability, low-code analytics components and cloud-friendly analytics development.

1. Data Hydration & Autoload

This custom step proves useful for cases where designated data files (available within folders) need to be loaded to a fresh environment. A further consideration arises when SAS Viya services are shut down during off-hours and restarted as per a schedule. Instead of writing multiple scripts to load required data to tables, multiple instances of this custom step (as many as required by data folder locations) can be configured within a SAS Studio Flow. This flow can be easily scheduled to run as a job right after Viya services start up, ensuring that data gets loaded to memory right away. Maintenance is also easier, as new data locations can be added (or removed if the project becomes inactive) as per need.

2. Staging Areas & Incremental Data

Analytics practitioners sometimes encounter situations when additional data comes in at intervals, and may need to be loaded to CAS either for profiling, data quality checks, or as part of larger ETL flows. It may not be possible to know the names of additional files every time. This custom step will be useful to load required data from the folder to a designated Caslib (and then 'flushing' the folder, or moving the source files elsewhere).

3. Additional Considerations when Loading Data

The development of this custom step helped us acknowledge that even an innocuous step of loading data from a source file to an in-memory CAS table has implications beyond the obvious. To put it simply, there might be many edge cases which are not accounted for. Wilbram Hazejager was very helpful in raising and helping solve some of these points in a collaborative manner (and collaboration is a key practice that the SAS Studio Custom Steps repo means to foster). Some were accounted for inside the step, some were partially addressed, and some others have been slated under future consideration. Here's an illustrative list:

i) Consider scenarios where supported file extensions increase in future and users may like to load a mixed set of files - e.g. load only CSV and Excel files among a larger set. Currently users have an option to load all file extensions or select a specific extension. Accounting for a mix of extensions will remove the need to run another instance of this custom step and render simplicity.

ii) In the unlikely event of source files with the same name but different extensions, each successive table (which has the same name) overwrites the previous. The reason is that currently the step loads the files in alphabetical order, and therefore the “last one will win”. A future enhancement for this step could be to add the file extension to the table name as a suffix (data_csv, data_sashdat etc.). But should this be done for all files, or only files which have the same name but different extensions? Let us know what you think.

iii) Many users may not be aware of the ramifications of using Promote. When a table is promoted to global scope, this means the table is now modified for all users. If a table did in fact exist earlier of the same name, it will be overwritten (and changes made by other users will be lost). Users of this custom step should carefully consider this before deciding to promote. An alternative policy would be to persist further changes with a different table name (which makes sense since the table has changed in character).

iv) Again, a very simple scenario which is easy to overlook: what if this folder has already been mapped to a caslib? Again, the main purpose of this custom step was to handle situations when any arbitrary folder (within a filesystem such as a NFS mount) may contain data to be loaded to CAS. In this custom step, we have inserted code which ensures that if the given folder path is already mapped to a caslib, the existing caslib will be used. Otherwise a temporary caslib will be used for the same (and dropped after the completion of this step).

The above considerations proved to be important food for thought when designing the step. When you use the custom step and refer the documentation (in the About tab as well as the README) and the source code, users will note comments which speak to these enhancement possibilities. Also, we would also like to point out that the logic followed in this Custom Step is very similar to the construct described in this SAS Communities article. Those interested in the programming aspects of SAS code will find it useful to understand how one can iterate among an array of result items within CASL and perform further processing. Please go through the same for a deeper understanding of the technicalities of obtaining a set of files within a directory and loading them one by one.

Have fun with the SAS Studio Custom Step for loading files to CAS! Feel free to comment or email in case you would like to share further thoughts and suggestions.