Solved: What happens if a grid node fails

Sajid01 · Posted 05-03-2021 12:44 PM

SAS Grid has been around for quite and sometime and are know for high availability and failover protection.

I am trying to understand the following hypothetical scenario. May not have happened to anybody, but could happen.

Would appreciate response from users who faced the scenario./SAS Grid experts.
1.Scenario 1 :When a grid compute node fails and is not accessible(The rest of grid installation including shared gfs2 file system is available). Assume network /hardware/power failure. . What happens to the jobs running on the node? Are the lost for ever OR or grid controller has the capability to resume them on the other nodes.
Assume every node has a temp workspace local to it.

2.Scenario 2 :Assume the failure happens to the Grid Control Server.. Again it is a physical hardware/Network/Power failure as in the earlier case). The rest of installation is available ( including the shared file system ) are up and running.

The common answer is the other nodes would take over. (Well that would means GC Server and Grid nodes are identical. Multiple instance of different components such as Web Infrastructure Data server / SAS Environment Manager are running one on each. machine. I don't think this is the case).

Consider the operating system as Linux.

doug_sas · Posted 05-04-2021 07:40 AM

When a grid node fails (regardless of whether it is the grid controller or just a server), the grid controller assumes that all of the jobs on that node have failed too. If the job is restartable and is in a queue that allows restarting, the job is requeued to be run again. If not, the job terminates with a HOST_FAILED error.

When the node that fails is the grid controller, a grid controller candidate takes over and manages the jobs that are currently running on all the other nodes. How a grid determines which grid controller candidate takes over is dependent on the grid provider.

A requeued job will usually restart from the beginning unless it has done something to help it pick up where it left off. SAS has data step checkpoint restart and label checkpoint restart which can help in these cases. You can submit a SAS program to the grid using the SAS Grid Manager Client Utility (SASGSUB) with an option to take advantage of data step checkpoint restart or label checkpoint restart. It should be noted that when a checkpoint restart option is used, the SASWORK will be changed to the grid shared file system to allow for the checkpoint capability.

View solution in original post

JuanS_OCS · Posted 05-04-2021 06:35 AM

Hello @Sajid01 ,

good questions. I will give you my view on them, and I suggest you to hold on for more and surely better voices, such as @EdoardoRiva , @RobCollum or @doug_sas.

To me, it all depends on your SAS and Infrastructure configuration and component versions. SAS Grid is flexible and allows different configurations, depending on the configuration, there are PROs and CONs, and the environment would behave differently.

For Scenario 1, as great example, you have the option to configure SASWORK not only in shared file system but also for failover and auto-resume, when and if the SAS job/code has been configured for all of that as well. The PRO is that Grid would allow to potentially resume perfectly the job from another Grid Worker even exactly when it was left (again, potentially), the CON is the overhead on more expensive storage, network requirements, and SAS coding. Otherwise, if this is not in place, the worst case scenario is that the job can be just cancelled and all new workloads would be load-balanced to another Grid Worker node, only if fail-over/HA is fully configured as you indicate. And between these 2 sub-scenarios and outcomes, you have several shades: depending on the configuration of the Grid, it could happen also that the jobs that were running in the failing node, could relaunch in the fail-over Grid Worked node. However this is not always a good idea,a s you may corrupt your registries of data.

The scenario 2 describes the fail-over of a Grid Controller node. Again, it is all about configuration and design. More HA/failover capabilities, require further overhead of design and resources. SAS Grid on itself allows full HA/fail-over of a Grid Controller node, allowing a backup Controller to fully take over the management workload, exactly as it was left by its previous failed Controller node, but you need to help hi to do that. And, yes, indeed, you need to consider full services failover/HA, not only the Grid components.

doug_sas · Posted 05-04-2021 07:40 AM

When a grid node fails (regardless of whether it is the grid controller or just a server), the grid controller assumes that all of the jobs on that node have failed too. If the job is restartable and is in a queue that allows restarting, the job is requeued to be run again. If not, the job terminates with a HOST_FAILED error.

When the node that fails is the grid controller, a grid controller candidate takes over and manages the jobs that are currently running on all the other nodes. How a grid determines which grid controller candidate takes over is dependent on the grid provider.

A requeued job will usually restart from the beginning unless it has done something to help it pick up where it left off. SAS has data step checkpoint restart and label checkpoint restart which can help in these cases. You can submit a SAS program to the grid using the SAS Grid Manager Client Utility (SASGSUB) with an option to take advantage of data step checkpoint restart or label checkpoint restart. It should be noted that when a checkpoint restart option is used, the SASWORK will be changed to the grid shared file system to allow for the checkpoint capability.

JuanS_OCS · Posted 05-04-2021 07:45 AM

Thanks a lof for chiming in @doug_sas, there is always something useful to read and learn form you.

Sajid01 · Posted 05-04-2021 10:23 AM

Thanks @JuanS_OCS and @doug_sas for helping me understand.
I wish I could mark both the replies as the solution

What happens if a grid node fails

Re: What happens if a grid node fails

Re: What happens if a grid node fails

Re: What happens if a grid node fails

Re: What happens if a grid node fails

Re: What happens if a grid node fails