According to this document https://go.documentation.sas.com/doc/en/calcdc/3.5/calserverscas/n08000viyaservers000000admin.htm.
env.CAS_HEARTBEAT_LOST_TIMEOUT='interval'
If the worker node did not response after the timeout intervals, the controller will treat it as a lost worker. We’d like to know does “treat it as lost” mean the controller is going to use the redundant data block or kill the cas process on that worker? Or is there any possible, the zombie process still on that worker and the controller still trying to get data from that worker?
Here is our scenerio.
20:19 hardware error on /var/log/message
20:20 controller logs like this 'A connection to peer node wk01.com was lost due to socket communication error, with status 104 ()'
But the next day, we found some of the global tables(with COPY=1) are broken.
Do we need to write a script to detect this kind of situation and kill the cas process on that worker? Does it help to trigger the redundant data block to be active?
Many thanks and any suggesion is helpful.
Best,
Stacey
... View more