The Transient Scope in CAS and why it matters when using Parquet files

4 Likes

You're already wondering. What is he talking about? You know what the Table Scope is in CAS, right?

We have Global-Scope CAS tables, which are available across all users and all sessions (subject to authorizations), especially used in the visual interfaces.

We have Session-Scope CAS tables, which are available to the CAS session owner only, especially used in programming environments.

And we have Transient-Scope CAS tables.

What are Transient-Scope CAS tables?

Basically, transient CAS tables are tables that are only available for the duration of a CAS action. Tables are automatically loaded and automatically dropped in an on-demand fashion. This is not a new capability. It's been there for a while.

How do you create a transient-scope CAS table?

By referencing the source file name in a CAS action call. For instance, this code will automatically load the prdsale.sashdat in CAS for the duration of the CAS action:

proc cas ;
   simple.summary result=r status=s /
      inputs={"actual","predict"},
      subSet={"SUM"},
      table={
         caslib="public",
         name="prdsale.sashdat",
         groupBy={"country","product","prodtype"}
      },
      casout={caslib="public",name="prdsale_summary",replace=True,replication=0} ;
quit ;

At the end of the CAS action, the file prdsale.sashdat will be dropped from memory.

Notice that if the same file has been explicitly loaded before, as a session table or global table, in the same CASLIB in a table named PRDSALE, this code does not make use of it because it refers to the source file name.

However, when working with database tables, the source database table and the target CAS table may have the same name (no extension), like in this example (BigQuery):

proc cas ;
   simple.summary result=r status=s /
      inputs={"actual","predict"},
      subSet={"SUM"},
      table={
         caslib="casbq",
         name="prdsale",
         groupBy={"country","product","prodtype"}
      },
      casout={caslib="casbq", name="prdsale_summary", replace=True, replication=0} ;
quit ;

In this case, if a table name PRDSALE is already loaded in the CASBQ CASLIB, this code will use it (no on-demand load). If not, this code will load the BigQuery table as a transient CAS table, run the CAS action and drop the CAS table.

Finally, keep in mind that transient tables can only be created when using explicit action programming with PROC CAS for example (or using a SWAT client). Using CAS-enabled data step or procedures relying on a CAS LIBNAME engine does not involve transient-scope CAS tables. FedSQL has also some load on-demand capabilities when it deals with database tables.

How can I check this mechanism?

You can monitor the execution of CAS actions by:

Specifying the metrics=true option when you start a CAS session:
```
cas mysession sessopts=(metrics=true) ;
```
Listing the history of executed CAS actions:
```
proc cas ;
   history ;
quit ;
```
Or combining both options

When running a CAS action directly on a source file name, you should see an implicit loadTable CAS action running prior to the coded one:

86   proc cas ;
87      simple.summary result=r status=s /
88         inputs={"actual","predict"},
89         subSet={"SUM"},
90         table={
91            caslib="public",
92            name="prdsale.parquet",
93            groupBy={"country","product","prodtype"}
94         },
95         casout={caslib="public", name="prdsale_summary", replace=True, replication=0} ;
96   quit ;
NOTE: Active Session now MYSESSION.
NOTE: Executing action 'table.loadTable'.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE:       real time               0.364339 seconds
NOTE:       cpu time                0.413444 seconds (113.48%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  2.59M (0.00%)
NOTE: Executing action 'simple.summary'.
NOTE: Action 'simple.summary' used (Total process time):
NOTE:       real time               0.799439 seconds
NOTE:       cpu time                4.634896 seconds (579.77%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  31.31M (0.05%)
NOTE: PROCEDURE CAS used (Total process time):
      real time           0.81 seconds
      cpu time            0.05 seconds

Why could it be useful?

Transient CAS tables can be useful in one-time use scenarios, when you need to run tasks on CAS tables only a few times. It can also make your code shorter avoiding loading/cleaning steps. Of course, if you run 50 CAS actions on the same file, it might not be efficient in terms of run time. Also, if your memory is tight on your CAS server, that could help to run your process with as low memory as possible.

What's the point with Parquet files?

There is a special case when working with Parquet files.

But first, let's review how normal loading works.

When you explicitly load (so not talking about transient scope here) an external file or database table in CAS, creating a session or global CAS table, it is converted internally in CAS to the SASHDAT in-memory format. This is the case for CSV, SAS7BDAT, Parquet, ORC, database tables, etc.

If you load a SASHDAT file, this conversion process does not have to happen since the file is already in the right format.

There are other subtleties (memory-mapping) that I will voluntarily not discuss here.

Then subsequent CAS actions use that in-memory SASHDAT structure in CAS for processing.

When you run a CAS action directly on a source file, creating a transient-scope CAS table, the process is similar... except for Parquet files.

The Parquet format has been deeply integrated in CAS. CAS actions can be run directly on Parquet data without it first being converted into CAS SASHDAT in-memory format. This can be very beneficial in some cases.

Let's take an example.

I have a 119MB Parquet file (13 columns, 4,320,000 rows). Its corresponding SASHDAT version (same table) is a 17GB file. This represents the potential size of this table in CAS when it will be loaded, with no failover copy.

A traditional process would be to load explicitly the Parquet file as a session-scope CAS table and then run a summary CAS action on top of it:

82   cas mysession sessopts=(metrics=true) ;
NOTE: The session MYSESSION connected successfully to Cloud Analytic Services demo-cas01.demo.sas.com using port 5570. The UUID is 
      28504173-ac64-514f-bcb6-80e251f651f3. The user is demo and the active caslib is CASUSERHDFS(demo).
NOTE: The SAS option SESSREF was updated with the value MYSESSION.
NOTE: The SAS macro _SESSREF_ was updated with the value MYSESSION.
NOTE: The session is using 3 workers.
NOTE: Action 'sessionProp.setSessOpt' used (Total process time):
NOTE:       real time               0.010730 seconds
NOTE:       cpu time                0.016915 seconds (157.64%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  1.07M (0.00%)
NOTE: The CAS statement request to update one or more session options for session MYSESSION completed.
83   
84   proc casutil incaslib="public" outcaslib="public" ;
NOTE: The UUID '28504173-ac64-514f-bcb6-80e251f651f3' is connected using session MYSESSION.
85      load casdata="prdsale.parquet" casout="prdsale" copies=0 replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the file prdsale.parquet available as table PRDSALE in caslib public.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE:       real time               24.176294 seconds
NOTE:       cpu time                62.068091 seconds (256.73%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  1.71G (2.95%)
NOTE: The Cloud Analytic Services server processed the request in 24.176294 seconds.
86   quit ;
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           24.18 seconds
      cpu time            0.03 seconds
      
87   
88   proc cas ;
89      simple.summary result=r status=s /
90         inputs={"actual","predict"},
91         subSet={"SUM"},
92         table={
93          caslib="public",
94            name="prdsale",
95            groupBy={"country","product","prodtype"}
96         },
97         casout={caslib="public", name="prdsale_summary", replace=True, replication=0} ;
98   quit ;
NOTE: Active Session now MYSESSION.
NOTE: Executing action 'simple.summary'.
NOTE: Action 'simple.summary' used (Total process time):
NOTE:       real time               0.397878 seconds
NOTE:       cpu time                3.731595 seconds (937.87%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  6.82M (0.01%)
NOTE: PROCEDURE CAS used (Total process time):
      real time           0.41 seconds
      cpu time            0.02 seconds

We can say that:

The Parquet file was loaded in CAS and thus converted into a SASHDAT format (~17GB) in 24.17 seconds
The simple.summary CAS action ran in 0.39 second
Total of 24.56 seconds

A variant process would be to directly run the summary CAS action on the Parquet data file involving a transient-scope table:

82   cas mysession sessopts=(metrics=true) ;
NOTE: The session MYSESSION connected successfully to Cloud Analytic Services demo-cas01.demo.sas.com using port 5570. The UUID is 
      25bf32f8-1e02-2343-8625-225513315485. The user is demo and the active caslib is CASUSERHDFS(demo).
NOTE: The SAS option SESSREF was updated with the value MYSESSION.
NOTE: The SAS macro _SESSREF_ was updated with the value MYSESSION.
NOTE: The session is using 3 workers.
NOTE: Action 'sessionProp.setSessOpt' used (Total process time):
NOTE:       real time               0.009630 seconds
NOTE:       cpu time                0.015265 seconds (158.52%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  1.07M (0.00%)
NOTE: The CAS statement request to update one or more session options for session MYSESSION completed.
83   
84   proc cas ;
85      simple.summary result=r status=s /
86         inputs={"actual","predict"},
87         subSet={"SUM"},
88         table={
89          caslib="public",
90            name="prdsale.parquet",
91            groupBy={"country","product","prodtype"}
92         },
93         casout={caslib="public", name="prdsale_summary", replace=True, replication=0} ;
94   quit ;
NOTE: Active Session now MYSESSION.
NOTE: Executing action 'table.loadTable'.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE:       real time               0.333423 seconds
NOTE:       cpu time                0.393961 seconds (118.16%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  2.59M (0.00%)
NOTE: Executing action 'simple.summary'.
NOTE: Action 'simple.summary' used (Total process time):
NOTE:       real time               0.729967 seconds
NOTE:       cpu time                4.622126 seconds (633.20%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            58.11G
NOTE:       memory                  31.31M (0.05%)
NOTE: PROCEDURE CAS used (Total process time):
      real time           0.74 seconds
      cpu time            0.03 seconds

Here we can say that:

An implicit loadTable action has been run
The Parquet file was loaded in CAS without SASHDAT conversion in 0.33 second
The simple.summary CAS action ran on the in-memory Parquet data in approximately 0.39 second (the 0.33 second loadTable run time is included in the 0.72 second summary runtime because it was triggered by it)
Total of 0.72 second

Which one do you prefer?

Here is a summary of the timings observed with a PATH CASLIB on a 4-nodes CAS environment. I have included additional measures with the 17GB SASHDAT file. You can see that loading a big SASHDAT file from a PATH CASLIB, even without a SASHDAT conversion to perform, can take much more time than loading the equivalent Parquet file. Of course, there are also ways to improve the loading of SASHDAT files (HDFS, DNFS, S3).

Scenario / Time in Seconds	Load Time	Summary Time	Total Time
Parquet in a session table	24.17	0.39	24.56
Parquet in a transient table	0.33	0.39	0.72
SASHDAT in a session table	153.23	1.07	154.30
SASHDAT in a transient table	178.76	0.33	179.09

Depending on your use cases, running CAS actions directly on transient-scope Parquet data might be very useful. Indeed, in that particular case, you can observe a ratio of more than 30 between the run of an explicit load of a Parquet file + the summary CAS action (24.56 sec.) versus the run of the same CAS action directly on the Parquet file (0.72 sec.), not mentioning the RAM usage (much more RAM is used with the SASHDAT conversion). Potentially, you could have run approximately 30+ CAS actions against the Parquet transient dataset for the cost of loading the data into SASHDAT.

Notice that not all CAS actions are as simple and fast as the simple.summary one. You will probably use different CAS actions that may change the equation. But this might still be something you want to evaluate. It could be an easy and quick win. First start by assessing the size of the Parquet file versus the size of the corresponding SASHDAT file. If there's a huge difference, then you might see an improvement in using Parquet transient data sets.

View the table scope documentation.

Thanks to Brian Bowman for his feedback.

Thanks for reading.