BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
fda2
Obsidian | Level 7

Hi,

I have an executive report on SAS VA and have to load necessary tables into CAS from a SQL server source every hour.  I basically do this:

 

proc sql;
	create table CASUSER.table1 as	
	   select ... /* new columns are generated */			
        from sqlsource1
	left join sqlsource2
        on ...
;
quit;

proc datasets library =PUBLIC memtype = (data view) nolist nowarn;
   delete table1;
quit;

proc casutil incaslib= 'mycasuser' outcaslib= 'public';
   promote casdata= 'table1';
run;

 

 

This operation happens every hour at 30 past. When a user attempts to open the report at 30 past, since the reloading table is not finished yet, the newly created columns can not be found in the report at that time. This yields to the report being in an erroneous state. How to avoid this situation? 

Thanks for reading.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@fda2 Good that you found a working approach. 

 

You are loading into a CAS session scope table. Report consumers are only impacted during the time when you drop the global scope table and promote the session scope table to global scope. This process should be rather quick and you can't improve it further.

 

I believe though there are some opportunities to improve on end-to-end performance and robustness. 

 

Current state

  1. end-to-end performance
    1. You can load a CAS table client side (push from SAS compute) or server-side (pull from CAS). 
    2. Client-side loads serial 
    3. Pro SQL create table <caslib>.<table> is client-side loading
  2. robustness
    1. If CAS goes down then all in-memory tables will get wiped out
    2. You have to re-run the load process to re-create the CAS table

Improvement ideas

  1. end-to-end performance
      1. Use Proc FedSQL for server side loading that runs in parallel (likely multi-node)
        1. You can use your existing SQL code for this if your join conditions don't contains expressions that SAS can't push to the database (else use explicit SQL)
          Using FedSQL to Pre-process Database Tables In-Database before Loading to CAS
      2. Load directly into caslib public
        1. The loaded table will be session scope.
        2. Under a caslib there can only be one global scope table with a specific name but you can per session also have a same named session scope table.
          1. These are separate objects and your session scope table won't impact any report consumers that use the same named global scope table.
      3. Drop the global scope table (use proc casutils or proc cas with a table action and not proc datasets).
      4. Promote the session scope table to global scope (this makes the table again available to your report consumers)
  2. Robustness
    1. Save the global scope CAS table back to it's physical permanent location
      1. As part of load process right after promotion to global scope 
      2. Permanent location as defined in the caslib definition for public (Proc casutils save or proc Cas table action)
      3. To allow for reload of data after a server restart without the need to re-run your load process
    2. After a CAS outage reload data directly from the permanent location into CAS
      1. No need to re-run the load process with the SQL query
      2. I would go for just-in-time loading Strategies for Reloading CAS Tables
        1. Just-in-time loading won't impact time for server start-up but still ensures data is available to report consumers already before your next scheduled load process runs.
        2. Server restarts during business hours should be rare so time penalty for the first report consumer after restart likely acceptable.
        3. You're using caslib Public which likely stores the permanent data as memory mapped sashdat files. If volumes aren't massive then load time likely acceptable.

Further considerations for report performance and CAS memory consumption

  • If you know that the reports often create aggregations by one or multiple classification variables then use the partition option to load data that belongs to the same values of classification variables onto the same CAS worker node. This will reduce data movements between CAS workers and though improve query performance.
    Example: Let's say you've got financial transaction data that includes a customer and account column. Each account has only one primary account holder. In such a case you could partition by customer as this will keep all transactions belonging to the same customer on the same CAS worker node. 
  • Consider how many replications you need (how many time to replicate data with copies on multiple nodes)
  • Consider the CAS data types - especially Varchar vs Char.
    • Only use Varchar if the length is >16 characters and if the strings are really varying.
    • If you have fixed length strings always use Char (Varchar consumes extra memory for storing the varying length information).

 

 

View solution in original post

7 REPLIES 7
Patrick
Opal | Level 21

since the reloading table is not finished yet

From how I understand your code you're loading the table first and the outage for your report consumers should be very short (drop/promote). Not sure if doing this within a single Proc Casutil as discussed here would make a significant difference. 

 

Another possibility (can't test. Don't have Viya access anymore😥) could be just-in-time loading. Instead of refreshing the in-memory table you just refresh it's physical leg and then drop the table. The next user accessing the report will then trigger the load. No idea though what the impact is for report consumers that were already working with your report before the refresh.

 

And as another idea:

You could have a load_dttm column in your CAS table and your report filters rows for max(load_dttm). You then append data to the existing CAS table as step 1 and then as step 2 delete the old rows.

 

And depending on your data you could also consider if there is a way that you can do a insert/update/delete instead of a full refresh ...but this would likely cause "wrong" data for the time where the delete already executed but the insert didn't finish yet.

fda2
Obsidian | Level 7

Hi @Patrick, so much thanks for the details. Using single proc casutil actually didn't affect too much in my case. I realized that the create table statement takes a lot of time. So I separated the extraction phase and loading into CAS phase. Since uploading to CAS is fast, downtime is reduced so much that the user does not notice it.

Updating rows instead of full refresh might be considered. But in my case, the table structure doesn't allow me to perform this.

I will keep in mind the just-in-time loading for further studies!

Thank you.

 

Patrick
Opal | Level 21

@fda2 Good that you found a working approach. 

 

You are loading into a CAS session scope table. Report consumers are only impacted during the time when you drop the global scope table and promote the session scope table to global scope. This process should be rather quick and you can't improve it further.

 

I believe though there are some opportunities to improve on end-to-end performance and robustness. 

 

Current state

  1. end-to-end performance
    1. You can load a CAS table client side (push from SAS compute) or server-side (pull from CAS). 
    2. Client-side loads serial 
    3. Pro SQL create table <caslib>.<table> is client-side loading
  2. robustness
    1. If CAS goes down then all in-memory tables will get wiped out
    2. You have to re-run the load process to re-create the CAS table

Improvement ideas

  1. end-to-end performance
      1. Use Proc FedSQL for server side loading that runs in parallel (likely multi-node)
        1. You can use your existing SQL code for this if your join conditions don't contains expressions that SAS can't push to the database (else use explicit SQL)
          Using FedSQL to Pre-process Database Tables In-Database before Loading to CAS
      2. Load directly into caslib public
        1. The loaded table will be session scope.
        2. Under a caslib there can only be one global scope table with a specific name but you can per session also have a same named session scope table.
          1. These are separate objects and your session scope table won't impact any report consumers that use the same named global scope table.
      3. Drop the global scope table (use proc casutils or proc cas with a table action and not proc datasets).
      4. Promote the session scope table to global scope (this makes the table again available to your report consumers)
  2. Robustness
    1. Save the global scope CAS table back to it's physical permanent location
      1. As part of load process right after promotion to global scope 
      2. Permanent location as defined in the caslib definition for public (Proc casutils save or proc Cas table action)
      3. To allow for reload of data after a server restart without the need to re-run your load process
    2. After a CAS outage reload data directly from the permanent location into CAS
      1. No need to re-run the load process with the SQL query
      2. I would go for just-in-time loading Strategies for Reloading CAS Tables
        1. Just-in-time loading won't impact time for server start-up but still ensures data is available to report consumers already before your next scheduled load process runs.
        2. Server restarts during business hours should be rare so time penalty for the first report consumer after restart likely acceptable.
        3. You're using caslib Public which likely stores the permanent data as memory mapped sashdat files. If volumes aren't massive then load time likely acceptable.

Further considerations for report performance and CAS memory consumption

  • If you know that the reports often create aggregations by one or multiple classification variables then use the partition option to load data that belongs to the same values of classification variables onto the same CAS worker node. This will reduce data movements between CAS workers and though improve query performance.
    Example: Let's say you've got financial transaction data that includes a customer and account column. Each account has only one primary account holder. In such a case you could partition by customer as this will keep all transactions belonging to the same customer on the same CAS worker node. 
  • Consider how many replications you need (how many time to replicate data with copies on multiple nodes)
  • Consider the CAS data types - especially Varchar vs Char.
    • Only use Varchar if the length is >16 characters and if the strings are really varying.
    • If you have fixed length strings always use Char (Varchar consumes extra memory for storing the varying length information).

 

 

fda2
Obsidian | Level 7

Thanks again @Patrick. I want to clarify some steps:

You offer to use fedsql(which I don't know now) and loading into PUBLIC instead of CASUSER caslib, right? 

proc sql;
	create table CASUSER.table1 as	
	   select ... /* new columns are generated */			
        from sqlsource1
	left join sqlsource2
        on ...
;
quit;

And for robustness, you suggest to save the global scope CAS table to it's physical permanent location. I couldn't figure out this step exactly. Assuming I applied your suggestions and my table will be on global scope in PUBLIC caslib eventually. Does promote action loads to table to the memory and the save operation saves the table to the disk?

Thanks in advance.

fda2
Obsidian | Level 7

Hi @Patrick , I am confused at another point. I tried to wrote the code according to your guideline, just please ignore the fedsql part. 

proc sql;
	CREATE TABLE PUBLIC.table1 AS
	SELECT
	       ...
	FROM
	       from sqlsource1
	       ...
quit;

proc casutil INCASLIB = 'public';
   DROPTABLE CASDATA = 'table1 ';
quit;

proc casutil INCASLIB= 'public' OUTCASLIB = 'public';
   PROMOTE CASDATA = 'table1 ';
run;

proc casutil;
    SAVE INCASLIB="public" CASDATA = "table1 " REPLACE;
quit;

In this way, I create and drop the table. Then in promote step CAS couldn't found a table named table1, no wonder. Which part is wrong?

Patrick
Opal | Level 21

@fda2 As already said I don't have access to Viya anymore and though everything below untested and based on my understanding how things work. 

 

CAS tables are volatile in-memory tables. If the CAS servers restart they get wiped out.

CAS Libs always define two "legs": The in-memory space and a permanent space (path= or datasrc=). 

 

If you run below statement then you should get info in the log how it's defined and where the permanent space is.

caslib public list;

I believe for the out-of-the-box caslib public the permanent data is path= and a sashdat file. sashdat can only get created when writing a CAS in-memory table to disk.

For other formats like a database table your process could also be to first update the physical table as defined in the caslib, load this table into CAS and promote it.

 

When you want to automate the load into memory after a server startup (like just-in-time loading) then the physical source data must be in the permanent location as defined in the caslib. That's what in below code step 4 does and what's required to support such functionality.

 

Below how I believe your process could look like. I've changed my mind and switched back to using casuser for the local scope table and public for the global scope table because I felt it's easier to understand (and though maintain).

cas mysession;

/* caslibs casuser and public must exist under mysession */

/* 1. create and load CAS in-memory table casuser.table1 */
proc fedsql sessref=mysession;
 create table casuser.table1{options replace=true} as
   select ...
   ;
quit;

/* 2. drop global scope table public.table1 */
proc casutil sessref=mysession; 
	droptable casdata="table1" incaslib="public" quiet; 
quit;

/* 3. promote local scope table casuser.table1 to global scope table public.table1 */
proc casutil sessref=mysession; 
	promote casdata="table1" incaslib="casuser" outcaslib="public" casout="table1"; 
quit;

/* 4. save data of in-memory global scope table public.table1 to its permanent location (=keep permanent data in synch) */
proc casutil sessref=mysession;
	save casdata="table1" incaslib="public" replace;
quit;

/* 5. drop local scope table casuser.table1 */
proc casutil sessref=mysession; 
	droptable casdata="table1" incaslib="casuser" quiet; 
quit;

I felt for development and debugging purposes it's better to use separate Procs per statement but below syntax option should also work.

proc casutil sessref=mysession; 
	/* 2. drop global scope table public.table1 */
	droptable casdata="table1" incaslib="public" quiet; 

	/* 3. promote local scope table casuser.table1 to global scope table public.table1 */
	promote casdata="table1" incaslib="casuser" outcaslib="public" casout="table1"; 

	/* 4. save data of in-memory global scope table public.table1 to its permanent location (=keep permanent data in synch) */
	save casdata="table1" incaslib="public" replace;

	/* 5. drop local scope table casuser.table1 */
	droptable casdata="table1" incaslib="casuser" quiet; 
quit;

And an additional thought:

You're currently using the ootb caslib public which is accessible to all SAS users. If you envisage that now or in the future not all but only a group of SAS users should have such access to the data then I'd recommend to already now define and use a different caslib (defined via Environment Manager). You then can apply permissions to this caslib to control who gets access to the data (according to your security model).

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 4451 views
  • 4 likes
  • 2 in conversation