Re: SAS creating orphan jobs/processes when a SAS program has some err...

sandeeppajni2 · Posted 03-11-2022 02:13 AM

Hi,

We have been using SAS 9.4 M6 where were have a cluster of 3 Meta nodes, 6 Compute nodes and 2 Web nodes. All these servers are on Linux OS.

We have been facing below two issues, which seem to be related, and we are looking for solution to these problems.

1. Whenever a SAS program has some error in it, the programmers close EG and create a new session or SAS EG closes itself, but the process IDs related to the session keep on running in the background and become orphan. Even if the faulty program is run through command line, it behaves the same, gives an error but the orphan process ID keeps on running at the background. Such orphan processes keep using the resources and take CPU usage to 100% or more. How to resolve such a problem?

2. Due to the high usage of the server resources, the particular compute node stops taking any new load, which is understandable. However, the whole Grid stops taking any new load or user session even within rest of the compute nodes for that particular App server. This makes the whole Grid hung in distributing the load and until we restart the Spawner in that particular node where process ID has reached to 100%, no new sessions are connected. So, we are unable to understand why the SAS Grid is not distributing the load to other compute nodes which are working fine.

Can someone please help with providing solution to these problems. Sometimes, if a particular Grid node reached to 100% CPU usage for user process IDs has also the WIP services running in it, that affects the whole SAS Studio access as well.

Thanks

SASKiwi · Posted 03-11-2022 02:31 AM

When you close EG normally it should clear any SAS sessions associated with it. Even if there have been errors and close EG normally (File-Close etc.) all SAS sessions should close. The only time I've seen orphan SAS sessions remain is if EG hangs or "wheel spins" and you have to kill it with Task Manager.

If this is not the behaviour you are seeing then I suggest to open a track with SAS Tech Support as the causes for this abnormal behaviour will require further investigation.

It is reasonably common practice to schedule the stopping and restarting of SAS server services on a regular basis, say daily or weekly and this will drop any orphan processes. You could perhaps use OS commands (Unix kill command) instead to remove any SAS sessions that run longer than a particular limit, like say 24 hours.

sandeeppajni2 · Posted 03-11-2022 02:46 AM

Hi @SASKiwi

Thanks for your quick reply. All the suggestions you have provided regarding monitoring of the server resources are already in place, like weekly reboot, killing of any jobs running for more than 24 hours (script scheduled to run every 12 hours). However, as soon as such orphan processes are created, they straight away start using the server resources and make the CPU usage to 100% or more, the SAS hangs then. We are in discussions with the infra team to also prepare some custom scripts which can identify such user processes taking 100% CPU, however it looks like a little difficult for them to identify each such user processes. They have measures for the complete server monitoring, but not for each user processes. Any suggestions to monitor such processes as well would help.

Kurt_Bremser · Posted 03-11-2022 07:08 AM

Activate server-side logging for the workspace server to see what kind of codes run when the issue happens. Maybe you can identify a common event that causes the abnormal CPU usage.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SASKiwi · Posted 03-11-2022 07:52 PM

@sandeeppajni2 - Again I'd suggest engaging SAS Tech Support to help with further diagnosing. Perhaps you have a SAS setting or configuration issue that is contributing to this. If you are maxing out on CPU due to orphan SAS processes within a 24-hour period then I'd say something is drastically going wrong somewhere.

It would also be useful to run a series of tests with EG to identify what EG behaviours result in orphan processes. For example, closing EG normally versus abnormally, killing while programs are still running. Is EG the only source of the problem?

sandeeppajni2 · Posted 03-12-2022 02:50 PM

Hi @SASKiwi

We have observed SAS behaves the same when run from command line, i.e. if there is an error in the SAS program, then also orphan processes are getting created in the background which take server resources.

SASKiwi · Posted 03-12-2022 05:44 PM

@sandeeppajni2 - This is definitely not normal SAS behaviour. Is it possible you have the SAS system option ERRORABEND set on one grid node but not on another? I have no experience with SAS Grid, but I would imagine if you don't align SAS system options across all nodes that may cause problems. Again Tech Support is your best bet to diagnose what is causing this.

Kurt_Bremser · Posted 03-13-2022 12:57 AM

Please show us the error messsages you find.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

sandeeppajni2 · Posted 03-14-2022 01:11 PM

Hi @Kurt_Bremser

It's not an error that we see. We just see some end user named sas sessions taking 100% CPU or more in a particular compute node. After Grid stops taking new load, we identify the faulty node having issues by validating each node and wherever we see connection problems we just restart the Spawner services of that particular node's App server. This fixes the issue.

Kurt_Bremser · Posted 03-11-2022 02:36 AM

Start by educating your users to not simply shut down EG, but issue a "Cancel" for the submitted code.

If a program started from the commandline goes into excessive CPU usage, it must have some faulty code in it that causes an infinite loop; this has to be corrected in the code. Any ERROR would simply cause the SAS session to go into syntax check mode and terminate rather quickly.

If you find a condition that causes an infinite loop without the SAS code being faulty, get in contact with SAS Technical Support, as this would be a bug in the SAS software itself.

I have introduced a measure against excessive usage by users. In the WorkspaceServer_usermods.sh shell script, I added code that checks for the number of processes already active for a given user. If this exceeds a threshold, the script exits right there. This means that users with crashed sessions have to contact the SAS admin who can take care of the orphaned processes.

As a SAS admin, it is part of your job to keep an eye on the state of your server(s). You might want to have a script running on your servers which detects overload conditions and sends you an email or other notification.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Jlochoa · Posted 03-14-2022 11:12 AM

I created a bat file that uses Putty to log the SAS user into the Linux server to run a bash file to kill zombie SAS sessions with the username. Then run the SAS cleanwork utility.

sandeeppajni2 · Posted 03-14-2022 01:08 PM

Hi @Jlochoa

Any basic idea that you can give on how to get a node scanned for such zombie SAS sessions with the username?

Jlochoa · Posted 03-24-2022 01:34 PM

There are two parts:
1) Create Batch File. User must have Putty app on PC. Batch File example below
2) Create Bash File on Linux Server running SAS WorkSpace Server

Batch Program
::FreeSoftwareServers
::Automated Opening of SSH Tunnel & Execute CMD on Remote Host
::https://superuser.com/questions/1278434/create-a-batch-file-or-shortcut-to-putty-ssh-that-opens-a-se...

@echo off
set puttydir="C:\Program Files\PuTTY"
set exe=plink.exe
::Profile must exist in PuTTY
set remotehost=Enter LINUX server running SAS Workspace Server
set remotecmd="/path/sas_zombie_killer_v2.sh"

cd %puttydir%

%exe% %remotehost% %remotecmd%
@echo on
::Test First Manually in CMD Prompt
::Note Remote Host does NOT have access to BashRC Alias's
::start "C:\Program Files\PuTTY\" plink.exe -ssh FileServer touch /tmp/testfile
::start "C:\Program Files\PuTTY\" plink.exe -ssh FileServer ~/script.sh

Bash SAS Zombie Killer Program saved in Linux server that users have access/permission to run.
#!/bin/sh
##############################################################################
# $Id:$
#
#
#
# Name : sas_zombie_killer_v1.sh
#
# Purpose : Kill zombie SAS sessions and delete orphan files using SAS Cleanwork utility
#
# Author : Jose Ochoa
#
#
# History :
# Change
# Date Userid Comment Code
# --------- ------ ----------------------------------------------- -----
# 21Jun2021 jochoa Initial script version 1
# 21Jun2021 jochoa Initial Added echo messages version 2
###############################################################################
echo " "
echo " "
echo " "
echo " "
echo "SAS Zombies Killer Program Running"

pids=( $( ps -ef | grep $USER | grep sas | grep -v grep | awk '{print $2}') )
for pid in "${pids[@]}"; do
if [[ $pid != $$ ]]; then
kill "$pid" &> /dev/null
fi
done

/SAS/SASHome/SASFoundation/9.4/utilities/bin/cleanwork /SAS/SASWORK -v &> /dev/null

echo " "
echo " "
echo "SAS Zombies killed and Orphan Files removed"
echo " "
echo " "
sleep 3
echo "Have A Nice Day!"
sleep 3

gwootton · Posted 03-14-2022 10:17 AM

Under normal operation when a Workspace Server has no clients the Object Spawner will terminate it. It sounds like these may be cases where users have submitted code triggering an infinite loop or something similar. I would agree that regularly restarting your environment can clear these out. You can also configure your grid with resource limits to prevent jobs from running longer than a given amount of time actively or idle. In that case the grid will terminate the job after it exceeds that value.

--
Greg Wootton | Principal Systems Technical Support Engineer

sandeeppajni2 · Posted 03-14-2022 01:07 PM

Hi @gwootton,

We are already monitoring and have kept OS level scripts in place to auto kill the long running jobs. However, our concern is why SAS Grid is failing to transfer any new requests to a new Compute node with 100% efficiency. We have analyzed if the orphan processes are created, they start taking most of the resources of that server, and in case if new SAS connection requests keep on coming, the SAS Grid algorithm fails and do not distribute jobs among other compute nodes until we restart the Spawner session in that particular node. So, we end up in end users raising connection issues to the Workspace servers and then we validate each of the nodes to identify the faulty one for restarting the Spawner in it.

SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources

Re: SAS creating orphan jobs/processes when a SAS program has some error, taking server resources