Solved: SAS Viya Report state is erroneous when loading CAS tables

fda2 · Posted 12-10-2024 06:42 AM

Hi,

I have an executive report on SAS VA and have to load necessary tables into CAS from a SQL server source every hour. I basically do this:

proc sql;
	create table CASUSER.table1 as	
	   select ... /* new columns are generated */			
        from sqlsource1
	left join sqlsource2
        on ...
;
quit;

proc datasets library =PUBLIC memtype = (data view) nolist nowarn;
   delete table1;
quit;

proc casutil incaslib= 'mycasuser' outcaslib= 'public';
   promote casdata= 'table1';
run;

This operation happens every hour at 30 past. When a user attempts to open the report at 30 past, since the reloading table is not finished yet, the newly created columns can not be found in the report at that time. This yields to the report being in an erroneous state. How to avoid this situation?

Thanks for reading.

Patrick · Posted 12-11-2024 07:55 PM

@fda2 Good that you found a working approach.

You are loading into a CAS session scope table. Report consumers are only impacted during the time when you drop the global scope table and promote the session scope table to global scope. This process should be rather quick and you can't improve it further.

I believe though there are some opportunities to improve on end-to-end performance and robustness.

Current state

end-to-end performance
1. You can load a CAS table client side (push from SAS compute) or server-side (pull from CAS).
2. Client-side loads serial
3. Pro SQL create table <caslib>.<table> is client-side loading
robustness
1. If CAS goes down then all in-memory tables will get wiped out
2. You have to re-run the load process to re-create the CAS table

Improvement ideas

end-to-end performance
Robustness
1. Save the global scope CAS table back to it's physical permanent location
  1. As part of load process right after promotion to global scope
  2. Permanent location as defined in the caslib definition for public (Proc casutils save or proc Cas table action)
  3. To allow for reload of data after a server restart without the need to re-run your load process
2. After a CAS outage reload data directly from the permanent location into CAS
  1. No need to re-run the load process with the SQL query
  2. I would go for just-in-time loading Strategies for Reloading CAS Tables
    1. Just-in-time loading won't impact time for server start-up but still ensures data is available to report consumers already before your next scheduled load process runs.
    2. Server restarts during business hours should be rare so time penalty for the first report consumer after restart likely acceptable.
    3. You're using caslib Public which likely stores the permanent data as memory mapped sashdat files. If volumes aren't massive then load time likely acceptable.

Further considerations for report performance and CAS memory consumption

If you know that the reports often create aggregations by one or multiple classification variables then use the partition option to load data that belongs to the same values of classification variables onto the same CAS worker node. This will reduce data movements between CAS workers and though improve query performance.
Example: Let's say you've got financial transaction data that includes a customer and account column. Each account has only one primary account holder. In such a case you could partition by customer as this will keep all transactions belonging to the same customer on the same CAS worker node.
Consider how many replications you need (how many time to replicate data with copies on multiple nodes)
Consider the CAS data types - especially Varchar vs Char.
- Only use Varchar if the length is >16 characters and if the strings are really varying.
- If you have fixed length strings always use Char (Varchar consumes extra memory for storing the varying length information).

View solution in original post

Patrick · Posted 12-10-2024 08:37 AM

since the reloading table is not finished yet

From how I understand your code you're loading the table first and the outage for your report consumers should be very short (drop/promote). Not sure if doing this within a single Proc Casutil as discussed here would make a significant difference.

Another possibility (can't test. Don't have Viya access anymore😥) could be just-in-time loading. Instead of refreshing the in-memory table you just refresh it's physical leg and then drop the table. The next user accessing the report will then trigger the load. No idea though what the impact is for report consumers that were already working with your report before the refresh.

And as another idea:

You could have a load_dttm column in your CAS table and your report filters rows for max(load_dttm). You then append data to the existing CAS table as step 1 and then as step 2 delete the old rows.

And depending on your data you could also consider if there is a way that you can do a insert/update/delete instead of a full refresh ...but this would likely cause "wrong" data for the time where the delete already executed but the insert didn't finish yet.

fda2 · Posted 12-11-2024 10:13 AM

Hi @Patrick, so much thanks for the details. Using single proc casutil actually didn't affect too much in my case. I realized that the create table statement takes a lot of time. So I separated the extraction phase and loading into CAS phase. Since uploading to CAS is fast, downtime is reduced so much that the user does not notice it.

Updating rows instead of full refresh might be considered. But in my case, the table structure doesn't allow me to perform this.

I will keep in mind the just-in-time loading for further studies!

Thank you.

Patrick · Posted 12-11-2024 07:55 PM

@fda2 Good that you found a working approach.

You are loading into a CAS session scope table. Report consumers are only impacted during the time when you drop the global scope table and promote the session scope table to global scope. This process should be rather quick and you can't improve it further.

I believe though there are some opportunities to improve on end-to-end performance and robustness.

Current state

end-to-end performance
1. You can load a CAS table client side (push from SAS compute) or server-side (pull from CAS).
2. Client-side loads serial
3. Pro SQL create table <caslib>.<table> is client-side loading
robustness
1. If CAS goes down then all in-memory tables will get wiped out
2. You have to re-run the load process to re-create the CAS table

Improvement ideas

end-to-end performance
Robustness
1. Save the global scope CAS table back to it's physical permanent location
  1. As part of load process right after promotion to global scope
  2. Permanent location as defined in the caslib definition for public (Proc casutils save or proc Cas table action)
  3. To allow for reload of data after a server restart without the need to re-run your load process
2. After a CAS outage reload data directly from the permanent location into CAS
  1. No need to re-run the load process with the SQL query
  2. I would go for just-in-time loading Strategies for Reloading CAS Tables
    1. Just-in-time loading won't impact time for server start-up but still ensures data is available to report consumers already before your next scheduled load process runs.
    2. Server restarts during business hours should be rare so time penalty for the first report consumer after restart likely acceptable.
    3. You're using caslib Public which likely stores the permanent data as memory mapped sashdat files. If volumes aren't massive then load time likely acceptable.

Further considerations for report performance and CAS memory consumption

If you know that the reports often create aggregations by one or multiple classification variables then use the partition option to load data that belongs to the same values of classification variables onto the same CAS worker node. This will reduce data movements between CAS workers and though improve query performance.
Example: Let's say you've got financial transaction data that includes a customer and account column. Each account has only one primary account holder. In such a case you could partition by customer as this will keep all transactions belonging to the same customer on the same CAS worker node.
Consider how many replications you need (how many time to replicate data with copies on multiple nodes)
Consider the CAS data types - especially Varchar vs Char.
- Only use Varchar if the length is >16 characters and if the strings are really varying.
- If you have fixed length strings always use Char (Varchar consumes extra memory for storing the varying length information).

Patrick · Posted 12-12-2024 12:54 AM

@fda2 To add to below here two more articles worth reading.

Seriously Serial or Perfectly Parallel Data Transfer with CAS

Seriously Serial or Perfectly Parallel Data Transfer with SAS® Viya®

fda2 · Posted 12-12-2024 04:26 AM

Thanks again @Patrick. I want to clarify some steps:

You offer to use fedsql(which I don't know now) and loading into PUBLIC instead of CASUSER caslib, right?

proc sql;
	create table CASUSER.table1 as	
	   select ... /* new columns are generated */			
        from sqlsource1
	left join sqlsource2
        on ...
;
quit;

And for robustness, you suggest to save the global scope CAS table to it's physical permanent location. I couldn't figure out this step exactly. Assuming I applied your suggestions and my table will be on global scope in PUBLIC caslib eventually. Does promote action loads to table to the memory and the save operation saves the table to the disk?

Thanks in advance.

fda2 · Posted 12-12-2024 10:01 AM

Hi @Patrick , I am confused at another point. I tried to wrote the code according to your guideline, just please ignore the fedsql part.

proc sql;
	CREATE TABLE PUBLIC.table1 AS
	SELECT
	       ...
	FROM
	       from sqlsource1
	       ...
quit;

proc casutil INCASLIB = 'public';
   DROPTABLE CASDATA = 'table1 ';
quit;

proc casutil INCASLIB= 'public' OUTCASLIB = 'public';
   PROMOTE CASDATA = 'table1 ';
run;

proc casutil;
    SAVE INCASLIB="public" CASDATA = "table1 " REPLACE;
quit;

In this way, I create and drop the table. Then in promote step CAS couldn't found a table named table1, no wonder. Which part is wrong?

Patrick · Posted 12-12-2024 06:40 PM

@fda2 As already said I don't have access to Viya anymore and though everything below untested and based on my understanding how things work.

CAS tables are volatile in-memory tables. If the CAS servers restart they get wiped out.

CAS Libs always define two "legs": The in-memory space and a permanent space (path= or datasrc=).

If you run below statement then you should get info in the log how it's defined and where the permanent space is.

caslib public list;

I believe for the out-of-the-box caslib public the permanent data is path= and a sashdat file. sashdat can only get created when writing a CAS in-memory table to disk.

For other formats like a database table your process could also be to first update the physical table as defined in the caslib, load this table into CAS and promote it.

When you want to automate the load into memory after a server startup (like just-in-time loading) then the physical source data must be in the permanent location as defined in the caslib. That's what in below code step 4 does and what's required to support such functionality.

Below how I believe your process could look like. I've changed my mind and switched back to using casuser for the local scope table and public for the global scope table because I felt it's easier to understand (and though maintain).

cas mysession;

/* caslibs casuser and public must exist under mysession */

/* 1. create and load CAS in-memory table casuser.table1 */
proc fedsql sessref=mysession;
 create table casuser.table1{options replace=true} as
   select ...
   ;
quit;

/* 2. drop global scope table public.table1 */
proc casutil sessref=mysession; 
	droptable casdata="table1" incaslib="public" quiet; 
quit;

/* 3. promote local scope table casuser.table1 to global scope table public.table1 */
proc casutil sessref=mysession; 
	promote casdata="table1" incaslib="casuser" outcaslib="public" casout="table1"; 
quit;

/* 4. save data of in-memory global scope table public.table1 to its permanent location (=keep permanent data in synch) */
proc casutil sessref=mysession;
	save casdata="table1" incaslib="public" replace;
quit;

/* 5. drop local scope table casuser.table1 */
proc casutil sessref=mysession; 
	droptable casdata="table1" incaslib="casuser" quiet; 
quit;

I felt for development and debugging purposes it's better to use separate Procs per statement but below syntax option should also work.

proc casutil sessref=mysession; 
	/* 2. drop global scope table public.table1 */
	droptable casdata="table1" incaslib="public" quiet; 

	/* 3. promote local scope table casuser.table1 to global scope table public.table1 */
	promote casdata="table1" incaslib="casuser" outcaslib="public" casout="table1"; 

	/* 4. save data of in-memory global scope table public.table1 to its permanent location (=keep permanent data in synch) */
	save casdata="table1" incaslib="public" replace;

	/* 5. drop local scope table casuser.table1 */
	droptable casdata="table1" incaslib="casuser" quiet; 
quit;

And an additional thought:

You're currently using the ootb caslib public which is accessible to all SAS users. If you envisage that now or in the future not all but only a group of SAS users should have such access to the data then I'd recommend to already now define and use a different caslib (defined via Environment Manager). You then can apply permissions to this caslib to control who gets access to the data (according to your security model).

SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Re: SAS Viya Report state is erroneous when loading CAS tables

Registration is open

SAS Training: Just a Click Away