BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Sajid01
Meteorite | Level 14

SAS Grid has been around for quite and sometime and are know for high availability and failover protection.

I am trying to understand the following hypothetical  scenario. May not have happened to anybody, but could happen.

Would appreciate response from users who faced the scenario./SAS Grid experts.
1.Scenario 1 :When a grid compute node fails and is not accessible(The rest of grid installation including shared gfs2 file system is available). Assume network /hardware/power failure. . What happens to the jobs running on the node? Are the lost for ever OR or grid controller has the capability to resume them on the other nodes.
Assume every node has a temp workspace local to it.

2.Scenario 2 :Assume the failure happens to the Grid Control Server.. Again it is a physical hardware/Network/Power failure as in the earlier case). The rest of installation is available ( including the shared file system ) are up and running.

The common answer is the other nodes would take over. (Well that would  means GC Server and Grid nodes are identical. Multiple instance of different components such as Web Infrastructure Data server / SAS Environment Manager are running one on each. machine. I don't think this is the case).


Consider the operating system as Linux.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
doug_sas
SAS Employee

When a grid node fails (regardless of whether it is the grid controller or just a server), the grid controller assumes that all of the jobs on that node have failed too. If the job is restartable and is in a queue that allows restarting, the job is requeued to be run again. If not, the job terminates with a HOST_FAILED error.

 

When the node that fails is the grid controller, a grid controller candidate takes over and manages the jobs that are currently running on all the other nodes. How a grid determines which grid controller candidate takes over is dependent on the grid provider.

 

A requeued job will usually restart from the beginning unless it has done something to help it pick up where it left off. SAS has data step checkpoint restart and label checkpoint restart which can help in these cases. You can submit a SAS program to the grid using the SAS Grid Manager Client Utility (SASGSUB) with an option to take advantage of data step checkpoint restart or label checkpoint restart. It should be noted that when a checkpoint restart option is used, the SASWORK will be changed to the grid shared file system to allow for the checkpoint capability.

View solution in original post

4 REPLIES 4
JuanS_OCS
Amethyst | Level 16

Hello @Sajid01 ,

 

good questions. I will give you my view on them, and I  suggest you to hold on for more and surely better voices, such as @EdoardoRiva , @RobCollum or @doug_sas.

 

To me, it all depends on your SAS and Infrastructure configuration and component versions. SAS Grid is flexible and allows different configurations, depending on the configuration, there are PROs and CONs, and the environment would behave differently.

 

For Scenario 1, as great example, you have the option to configure SASWORK not only in shared file system but also for failover and auto-resume, when and if the SAS job/code has been configured for all of that as well. The PRO is that Grid would allow to potentially resume perfectly the job from another Grid Worker even exactly when it was left (again, potentially), the CON is the overhead on more expensive storage, network requirements, and SAS coding. Otherwise, if this is not in place, the worst case scenario is that the job can be just cancelled and all new workloads would be load-balanced to another Grid Worker node, only if fail-over/HA is fully configured as you indicate. And between these 2 sub-scenarios and outcomes, you have several shades: depending on the configuration of the Grid, it could happen also that the jobs that were running in the failing node, could relaunch in the fail-over Grid Worked node. However this is not always a good idea,a s you may corrupt your registries of data.

 

The scenario 2 describes the fail-over of a Grid Controller node. Again, it is all about configuration and design. More HA/failover capabilities, require further overhead of design and resources. SAS Grid on itself allows full HA/fail-over of a Grid Controller node, allowing a backup Controller to fully take over the management workload, exactly as it was left by its previous failed Controller node, but you need to help hi to do that. And, yes, indeed, you need to consider full services failover/HA, not only the Grid components.

 

 

doug_sas
SAS Employee

When a grid node fails (regardless of whether it is the grid controller or just a server), the grid controller assumes that all of the jobs on that node have failed too. If the job is restartable and is in a queue that allows restarting, the job is requeued to be run again. If not, the job terminates with a HOST_FAILED error.

 

When the node that fails is the grid controller, a grid controller candidate takes over and manages the jobs that are currently running on all the other nodes. How a grid determines which grid controller candidate takes over is dependent on the grid provider.

 

A requeued job will usually restart from the beginning unless it has done something to help it pick up where it left off. SAS has data step checkpoint restart and label checkpoint restart which can help in these cases. You can submit a SAS program to the grid using the SAS Grid Manager Client Utility (SASGSUB) with an option to take advantage of data step checkpoint restart or label checkpoint restart. It should be noted that when a checkpoint restart option is used, the SASWORK will be changed to the grid shared file system to allow for the checkpoint capability.

JuanS_OCS
Amethyst | Level 16

Thanks a lof for chiming in @doug_sas, there is always something useful to read and learn form you.

Sajid01
Meteorite | Level 14

Thanks @JuanS_OCS and @doug_sas  for  helping me understand.
I wish I could mark both the replies as the solution

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Discussion stats
  • 4 replies
  • 1310 views
  • 4 likes
  • 3 in conversation