Welcome back to my "Simplifying SAS Viya" series. In Part 1, we discussed differences between the Compute and CAS Server, exploring when to use each and what happens behind the scenes. In this post, we’ll focus on what caslibs are, how to use them, and how they compare to traditional SAS libraries.
Throughout this post, examples are included using a SAS Viya LTS 2024.09 environment. For demonstration purposes, I have created a folder OrdersData containing data about customers and their orders from a fictitious company. This folder includes several file types.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Traditional SAS Libraries
In SAS 9, libraries are created with the LIBNAME statement and connect to data sources such as databases, cloud data, or folder paths. A library reference or “libref”, acts as an alias to the physical data source. In code, data is referenced as libref.tablename. Libraries function similarly in SAS Viya when working with data on the Compute Server.
Here is an example of creating a libref named ordLib for SAS tables in the OrdersData folder. Using the FREQ procedure, the orders table in ordLib was used to display the number of orders placed in each country.
LIBNAME ordLib base "/home/student/Courses/PGVY/OrdersData";
PROC FREQ data=ordLib.orders;
TABLES Country /nocum nopercent;
RUN;
The library ordLib is visible in the Libraries pane, containing three SAS tables.
Partial results:
This code executes on the Compute Server. As discussed in my previous post, running code on the Compute Server works well unless you're working with a large dataset (over 50GB), programs that requiring multiple reads of the same data, or computationally intensive tasks. In those cases, using the CAS Server is beneficial due to its high speed, in-memory processing capabilities.
Working with Data in CAS
What changes when working with data on the CAS Server?
One of the benefits of working in CAS is that data is held in-memory, reducing I/O. To process data in-memory, we must first create a CAS session. The session encapsulates or “keeps a record” of our work and connects us to the CAS Server. Use a CAS statement to create a session. For example:
CAS CJsSession;
Now we’re connected to the CAS Server, the session is named CJsSession and the CAS Server is ready to process in-memory data. After establishing a session, connect data to the CAS Server using caslibs. Caslibs connect to a variety of data sources such as data in the cloud, databases, folder paths, and streaming data.
Creating a Caslib
In SAS Viya, predefined caslibs are available, and users can create their own caslibs with permission from the SAS administrator.
Use the CASLIB statement to create a caslib:
CASLIB caslibName PATH="/filepath/" LIBREF=libref;
caslibName: Up to 256 characters, cannot start with a number, and contains only numbers, letters and underscores.
PATH= option: Specifies the location of the physical data files. Unlike traditional SAS libraries, caslibs can contain multiple file types.
LIBREF= option: Maps a libref to the caslib. The libref must follow standard naming conventions. It's common to make the libref and caslib names the same for clarity.
Why Map a Libref to a Caslib?
Mapping a libref to a caslib serves two critical purposes:
The mapped libref is visible in the Libraries pane, showing the caslib and any available in-memory tables.
The libref allows referencing in-memory tables in traditional SAS DATA steps and procedures as libref.tablename.
For example, let's create a caslib for the data source files in the OrdersData folder. Without the LIBREF= option, the caslib ordersCaslib is created successfully but it is not visible in the Libraries pane.
CASLIB ordersCaslib PATH="/home/student/Courses/PGVY/OrdersData";
Mapping the LIBREF= option displays the libref connected to the caslib in the Libraries pane. You can recognize it is a libref mapped to a caslib by its cloud icon rather than the file cabinet icon. The example below creates the ordCas caslib and assigns the libref ordCas.
CASLIB ordCas PATH="/home/student/Courses/PGVY/OrdersData" LIBREF=ordCas;
Mapping a Libref to a Predefined Caslib
Predefined caslibs are often set up by a SAS administrator. To use a predefined caslib that is not visible in the Libraries pane, map a libref to the caslib with the LIBNAME statement.
LIBNAME libref CAS CASLIB=caslibName;
Libref: Must follow SAS naming conventions and should be the same or representative of the caslibName.
Engine: Always CAS for caslibs.
CASLIB= option: Specifies the existing caslib. Caslib names can be longer than librefs, and may contain spaces. If the caslibName contains a space, add quotation marks around it:
LIBNAME libref CAS CASLIB= 'caslib Name With Spaces';
To view all predefined caslibs, run:
CASLIB _all_ list;
One predefined caslib in my environment is ModelPerformanceData.
To create a libref named mpd to that caslib, run:
LIBNAME mpd CAS CASLIB=ModelPerformanceData;
The libref mpd mapped to the caslib ModelPerformanceData now appears in the libraries pane:
Caslib Attributes
Caslibs have attributes that describe their data connection and user access. The three main attributes are Local, Active and Personal.
Local
Local=Yes: A local caslib is “session scope”. If the caslib is created in SAS Studio, it is not visible in another application like SAS Visual Analytics. When the CAS session ends, the caslib is deleted. Session scope is useful when working with data that is not shared across sessions.
Traditional SAS comparison: Traditional SAS libraries are deleted at the end of a SAS session. You must run a LIBNAME statement to work with your data again.
Local=No: If a caslib is not local, it is “global scope”. The caslib is available to anyone with permission to access it, and the caslib is visible across applications. When the CAS session ends, the caslib is not deleted. Global scope is useful when sharing data across sessions, with other users and when working with large data that you do not want to load and unload from memory often.
Traditional SAS comparison: Traditional SAS has predefined libraries such as SASHELP that persist across SAS sessions.
Personal
Personal=Yes: The caslib is only available to you.
Personal=No: The caslib is available to other users.
Active
In traditional SAS, the work library is the default when a library is not specified in our code. The CAS version of the work library is the “active” caslib. The active caslib can change. It is typically casuser by default, however, if you have just created a caslib, that new caslib will become the active caslib. It is best practice to specify the caslib you are working with.
Active=Yes: The caslib is the current default caslib if no other caslib is specified.
Active=No: The caslib is available to use but the caslib name must be specified to use it.
To view the casuser caslib attributes, use the CASLIB statement.
CASLIB casuser list;
Casuser is a global scope caslib, it is currently the active/ default caslib if no other caslib is specified and it is personal meaning it is available only to me.
Creating the new caslib ordCas will change the active caslib to ordCas. It is also session scope, and it is not a personal caslib.
CASLIB ordCas PATH="/home/student/Courses/PGVY/OrdersData" LIBREF=ordCas;
CASLIB ordCas list;
Run the following again. Notice that Active = No.
CASLIB casuser list;
Files vs. Tables in Caslibs
So far, we have learned that when working with our data in CAS, we must start a CAS session. Then, create a caslib and assign a library reference so it is visible in the Libraries pane and can be referenced in a program as libref.tablename. Finally, we explored the Local, Personal and Active attributes of caslibs.
Let’s talk more about data connections in a caslib. Caslibs can connect to a variety of data sources such as data in the cloud, databases, folder paths and streaming data.
Traditional SAS libraries have one main component, the connection to the data source file. Caslibs have three main components:
The connection to the data source files.
The in-memory portion for data that has been loaded into memory.
Access controls to define permissions to a specific caslib.
Use the CASUTIL procedure to view data source files and in-memory tables in a caslib.
PROC CASUTIL <INCASLIB="caslib-name">;
LIST FILES | TABLES;
QUIT;
INCASLIB= is optional, however if a caslib is not specified, the active caslib will be assumed. This is one instance where I recommend being explicit with the caslib name since the active caslib can change.
Let’s look at an example. Remember, we mapped a libref to the caslib ordCas. The folder OrdersData has six files, and a variety of file types.
CASLIB ordCas PATH="/home/student/Courses/PGVY/OrdersData" LIBREF=ordCas;
When looking at the ordCas libref in the Libraries pane, the library appears empty:
This is because only in-memory tables are visible in a library mapped to a caslib in the Libraries pane. The following step displays the data source files available in the ordCas caslib:
PROC CASUTIL INCASLIB="ordCas";
LIST files;
QUIT;
The caslib attributes and the six data source files in the caslib are listed.
To see in memory tables in the caslib, run the following:
PROC CASUTIL INCASLIB="ordCas";
LIST tables;
QUIT;
Tables have not been loaded into memory, so only the caslib attributes are visible. The log contains the follow message:
The next step is to load data source files into memory. There are multiple methods for loading data into memory. In this post I am going to keep my explanation simple and use the familiar CASUTIL procedure to load a data source file into memory:
PROC CASUTIL;
LOAD CASDATA="orders.csv"
INCASLIB="ordCas"
CASOUT="Orders"
OUTCASLIB="ordCas";
QUIT;
CASDATA="orders.csv" defines the data source file to load into memory.
INCASLIB="ordcas" defines the caslib the data source file is currently in.
CASOUT="Orders" defines the output CAS table name.
OUTCASLIB="ordCas" defines the caslib the in-memory table will be in.
After running the step, the log displays the following messages:
In the Libraries pane, the in-memory table Orders in the ordCas caslib is visible. Notice in-memory tables are marked with a lightning bolt:
The following step lists tables for the ordCas caslib. Notice Orders is listed:
PROC CASUTIL INCASLIB="ordcas";
LIST tables;
QUIT;
I can use libref.tablename in my code to work with this in-memory table.
DATA ordCas.ordersAustralia;
SET ordCas.orders;
WHERE Country="Australia";
RUN;
This data step created an in-memory table in ordCas named ordersAustralia, using the input table Orders in the ordCas caslib. I filtered for orders placed by customers in Australia. The DATA step ran in CAS and returned 60,320 observations. This data step ran in CAS because both tables were in a caslib, and everything in the data step was “CAS enabled” (certain syntax is not allowed in CAS- this DATA step contained all valid syntax). When the program executed, the Compute Server saw valid syntax for CAS and sent it over to the CAS Server. The data was divided to multiple worker nodes to complete processing. As they finished processing, data was returned to the controller node where the table was reassembled. Then it was presented to us back in SAS Studio to view.
In Summary
To work with data on the CAS Server:
Start a CAS session.
Create caslibs and map librefs to view them in the Libraries pane and reference them in code.
Use the CASUTIL procedure to list files or tables in a caslib and load data into memory.
Stay tuned for the next post in this series as we continue to simplify SAS Viya together!
Simplifying SAS Viya Part 1: Choose Your Server
Find more articles from SAS Global Enablement and Learning here.
... View more