We have been using SAS 9.4 M6 where were have a cluster of 3 Meta nodes, 6 Compute nodes and 2 Web nodes. All these servers are on Linux OS.
We have been facing below two issues, which seem to be related, and we are looking for solution to these problems.
1. Whenever a SAS program has some error in it, the programmers close EG and create a new session or SAS EG closes itself, but the process IDs related to the session keep on running in the background and become orphan. Even if the faulty program is run through command line, it behaves the same, gives an error but the orphan process ID keeps on running at the background. Such orphan processes keep using the resources and take CPU usage to 100% or more. How to resolve such a problem?
2. Due to the high usage of the server resources, the particular compute node stops taking any new load, which is understandable. However, the whole Grid stops taking any new load or user session even within rest of the compute nodes for that particular App server. This makes the whole Grid hung in distributing the load and until we restart the Spawner in that particular node where process ID has reached to 100%, no new sessions are connected. So, we are unable to understand why the SAS Grid is not distributing the load to other compute nodes which are working fine.
Can someone please help with providing solution to these problems. Sometimes, if a particular Grid node reached to 100% CPU usage for user process IDs has also the WIP services running in it, that affects the whole SAS Studio access as well.
When you close EG normally it should clear any SAS sessions associated with it. Even if there have been errors and close EG normally (File-Close etc.) all SAS sessions should close. The only time I've seen orphan SAS sessions remain is if EG hangs or "wheel spins" and you have to kill it with Task Manager.
If this is not the behaviour you are seeing then I suggest to open a track with SAS Tech Support as the causes for this abnormal behaviour will require further investigation.
It is reasonably common practice to schedule the stopping and restarting of SAS server services on a regular basis, say daily or weekly and this will drop any orphan processes. You could perhaps use OS commands (Unix kill command) instead to remove any SAS sessions that run longer than a particular limit, like say 24 hours.
Activate server-side logging for the workspace server to see what kind of codes run when the issue happens. Maybe you can identify a common event that causes the abnormal CPU usage.
@sandeeppajni2 - Again I'd suggest engaging SAS Tech Support to help with further diagnosing. Perhaps you have a SAS setting or configuration issue that is contributing to this. If you are maxing out on CPU due to orphan SAS processes within a 24-hour period then I'd say something is drastically going wrong somewhere.
It would also be useful to run a series of tests with EG to identify what EG behaviours result in orphan processes. For example, closing EG normally versus abnormally, killing while programs are still running. Is EG the only source of the problem?
@sandeeppajni2 - This is definitely not normal SAS behaviour. Is it possible you have the SAS system option ERRORABEND set on one grid node but not on another? I have no experience with SAS Grid, but I would imagine if you don't align SAS system options across all nodes that may cause problems. Again Tech Support is your best bet to diagnose what is causing this.
Start by educating your users to not simply shut down EG, but issue a "Cancel" for the submitted code.
If a program started from the commandline goes into excessive CPU usage, it must have some faulty code in it that causes an infinite loop; this has to be corrected in the code. Any ERROR would simply cause the SAS session to go into syntax check mode and terminate rather quickly.
If you find a condition that causes an infinite loop without the SAS code being faulty, get in contact with SAS Technical Support, as this would be a bug in the SAS software itself.
I have introduced a measure against excessive usage by users. In the WorkspaceServer_usermods.sh shell script, I added code that checks for the number of processes already active for a given user. If this exceeds a threshold, the script exits right there. This means that users with crashed sessions have to contact the SAS admin who can take care of the orphaned processes.
As a SAS admin, it is part of your job to keep an eye on the state of your server(s). You might want to have a script running on your servers which detects overload conditions and sends you an email or other notification.
The SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment.
Learn how to install the SAS Viya CLI and a few commands you may find useful in this video by SAS’ Darrell Barton.
Find more tutorials on the SAS Users YouTube channel.