How CAS controller distributes data to other cas worker nodes?

sa_spartan · Posted 01-27-2025 09:45 PM

Hi Communities,

Do you have any idea how CAS controller distributes data to other CAS workers?
Like for example if I have 1 controller and 5 workers and I will load 100GB of data, will the data size be automatically split to 5 workers like 20GB of data each?

gwootton · Posted 02-03-2025 09:03 AM

I think your understanding is correct, the in-memory table will be balanced among the nodes when loaded. Tables can become unbalanced though if nodes are added or removed unless CAS table rebalancing is enabled.

CAS Server Topology Changes and CAS Table Balancing

https://communities.sas.com/t5/SAS-Communities-Library/CAS-Server-Topology-Changes-And-CAS-Table-Bal...

CAS Table Rebalancing

https://go.documentation.sas.com/doc/en/sasadmincdc/v_060/calserverscas/n05000viyaservers000000admin...

--
Greg Wootton | Principal Systems Technical Support Engineer

Patrick · Posted 02-03-2025 12:42 PM

"Do you have any idea how CAS controller distributes data to other CAS workers?"

I believe depending on loading approach (client or server side) and data source the load can also be directly to the workers without passing via controller.

"will the data size be automatically split to 5 workers like 20GB of data each?"

By default there is data replication for resilience so that when one worker goes offline you've still got access to all the data and can continue working. No more sure if the replication default is 2 or 3. If it's 3 then it would mean that you'll get 60GB per worker.

...and then: Data in memory will be expanded so what you've got on disk compressed might take-up quite a bit more memory in CAS (there is some sort of compression available that depending on your data can be quite effective). Also if you map on-disk CHAR to CAS Varchar then be aware that a CAS Varchar uses 16 bytes to start with (bit counter intuitive but that's what it is).

If you chose to define your CAS table as partitioned then depending on the variable(s) used for partitioning your data can be skewed with different volumes on the different nodes.

...yep, CAS doesn't make things easier. You need to know your data and how you want to use them so you can plan ahead how to load them. For example for a smaller table that serves as a lookup table it might be worth to replicate the table onto all nodes because that will reduce data movement between nodes during lookup (and though increase performance).
Or if you know that you'll often aggregate by some variable(s) - like for example customer - then it might be worth to use these variable(s) as partition key so that for example rows with the same customer get loaded onto the same worker (which again will reduce data movements between workers when aggregating).

How CAS controller distributes data to other cas worker nodes?

Re: How CAS controller distributes data to other cas worker nodes?

Re: How CAS controller distributes data to other cas worker nodes?