BookmarkSubscribeRSS Feed
gwootton
SAS Super FREQ
When a job is submitted to the grid, the grid will put the job in a pending state while it makes a determination where to send the job for execution. If new jobs are not being executed anywhere that suggests that there are no eligible hosts for that job. This could be because the nodes have all exceeded their configured thresholds (i.e. the nodes are all busy), have no available job slots (i.e. the nodes are full) or the nodes have been administratively closed. It's also possible to limit the number of jobs a single user can submit to a queue or host, so it could be a user level limitation.
Stopping the spawner would cause the spawner to terminate any jobs it submitted to the grid, which could clear up those type of issues. You would need to examine the grid hosts listing to determine why a given host is not accepting new jobs.
--
Greg Wootton | Principal Systems Technical Support Engineer
sandeeppajni2
Obsidian | Level 7
Hi @gwootton

I would assume just the opposite that as soon as we restart the Spawner service of that particular node's App server, Grid is able to work freely to start distributing the jobs again. In our case the other nodes are available with the slots but Grid is somehow stuck with filling the load in that particular node itself where usage is 100% and does transfer the load the other nodes automatically for distributing the new session requests. Is it a normal behavior of the system?
gwootton
SAS Super FREQ
The object spawner is not responsible for starting grid jobs on a given node. The grid daemon processes do this. In the case of an LSF grid this is the sbatchd process, with a SAS Workload Orchestrator grid it's the sgmg process. So a node could have grid jobs running on it and no object spawner process running on it. The object spawner submits jobs to the grid, and it is the grid that decides where to run that job.

The grid decides which host will run the job before the job starts. That host selection is made based on which hosts are eligible to run the job (host has available job slots or is otherwise in an open/ready state, and has any requested resources/tags specified by the job), and then the grid decides which eligible host to send the job to based on a comparison of load metrics, for example which host has the lowest CPU queue average taken over 1 minute (r1m).
--
Greg Wootton | Principal Systems Technical Support Engineer
sandeeppajni2
Obsidian | Level 7
Hi @gwootton

Thanks for explaining the functioning of the Grid and Spawner processes and their orders of working. So, it is not the Grid which has issues as Spawner itself is not allowing the jobs to go to Grid for distribution (we use SWO/ sgmg).
However, I am failing to understand why then after we just restart the particular App server's Spawner service, everything becomes normal (we have 7 SAS App servers). Why all the new jobs are not going to other compute nodes' Spawners but just to that particular faulty Spawner for the users. So, there must be a algorithm or logic running into SAS that is choosing that Spawner only to make the new user connections. I am trying to understand that. Hope I am making a sense.
gwootton
SAS Super FREQ
Which object spawner to connect to is a decision made by the client rather than the Object Spawner. When a client like Enterprise Guide or SAS Studio needs a Workspace Server it will try to talk to the first Object Spawner it finds in Metadata. That Object Spawner would then either launch the Workspace Server itself or redirect to another Object Spawner for launching. The launch process in the case of grid is to submit the Workspace Server command as a grid job.

You may want to engage with SAS Technical Support so we can get a better idea of the behavior you are seeing and review the logs (Object Spawner and Grid) during a failure event.
--
Greg Wootton | Principal Systems Technical Support Engineer
sandeeppajni2
Obsidian | Level 7
Hi @gwootton

I think you are tying to say which App server to connect to is decided by the end user, however which compute node's Spawner the request goes to is decided in the background. Am I correct?
So, on basis of what algorithm or logic the SAS decides to send the request to a specific Object Spawner, that is something I am trying to understand, as somehow SAS decides to send the load to the faulty Spawner or Grid only. I have scheduled some sessions with the SAS SMEs and would try to understand the logic behind the jobs distribution and how it can be optimized in our case.
gwootton
SAS Super FREQ
In Enterprise Guide for example when you expand the "SASApp" context, the client (Enterprise Guide) looks in Metadata for a list of host/port combinations for the Workspace Server in that context (i.e. compute1.example.com:8591, compute2.example.com:8591, computen.example.com:8591). It will then attempt to connect to the first one it finds (compute1.example.com:8591) to request a workspace server. The process it is connecting to is the Object Spawner. The Object Spawner would then decide whether to handle that request itself or redirect to another spawner, but it has no control on whether Enterprise Guide initiated the connection to compute1.
--
Greg Wootton | Principal Systems Technical Support Engineer
AhmedAl_Attar
Rhodochrosite | Level 12

@sandeeppajni2 

Here is a somewhat an old SAS paper https://support.sas.com/resources/papers/proceedings/pdfs/sgf2008/391-2008.pdf,

that touches on what @gwootton was trying to explain to you.

This paper discusses the features of SAS 9.2, I'm not sure which version/release of SAS your organization is using, you can always check the SAS online documentation for your SAS release, to find the latest updates, enhancements and added features since the 9.2 release. 

 

Hope this provides additional clarity on how things flow, when dealing with SAS Grid Manager

AhmedAl_Attar
Rhodochrosite | Level 12

Hi @sandeeppajni2 

You have multiple levels of controlling this behavior from the SAS Server side

  1. At the SASApp context level:
    • Modify and apply your site specific settings to the [SAS Config]/Lev1/SASApp/sasv9_usermods.cfg file, by adding -ERRORABEND to file
  2. At the Server Context level (BatchServer, WorkspaceServer, ...etc)
    • Modify and apply your site specific settings to the [SAS Config]/Lev1/SASApp/WorkspaceServer/sasv9_usermods.cfg file, by adding -ERRORABEND to file

Hope this helps

sandeeppajni2
Obsidian | Level 7
Hi @AhmedAl_Attar

That is something we would not want to limit end users to have their sessions killed. Even the SAS session is killed the background jobs are kept running, eventually taking all the resources in that particular compute server node. So, we end up with the second problem that I described
AhmedAl_Attar
Rhodochrosite | Level 12
Can you please explain how these Background jobs are kept alive if the main SAS server session that running these jobs is terminated?
The second question that comes to mind -- Are you talking about background SAS jobs or Linux Bash jobs?
sandeeppajni2
Obsidian | Level 7
Hi @AhmedAl_Attar

It's the user's process id in the OS we see running even if the program running through command line or through EG which has errors in it is shown closed or terminated.
AhmedAl_Attar
Rhodochrosite | Level 12
Once you add -ERRORABEND option, you will see these processes after failure!
The Whole purpose of this option is terminating the SAS session/Process running once an error is encountered.

You can experiment using this option (-ERRORABEND) as a command line argument to SAS invocation command with a program that has a deliberate error , such as
$ sas -sysin someProgramWithErrors.sas -errorabend -nodms -noterminal ; rc=$?

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

CLI in SAS Viya

Learn how to install the SAS Viya CLI and a few commands you may find useful in this video by SAS’ Darrell Barton.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 28 replies
  • 1595 views
  • 5 likes
  • 6 in conversation