Re: SAS Workspace: stop accepting client connections - Page 2

nhvdwalt · Posted 11-07-2018 01:22 PM

CLOSE_WAIT is a normal state and it's ok to have some of them. If you have like 1000's then it will be a problem.

You could maybe script the CLOSE_WAIT monitoring and see how many you have when next you run into this issue. Unfortunately you'll have to wait and see what the system state is upon next occurrence.

MariaD · Posted 11-07-2018 09:24 PM

Thanks @nhvdwalt. Today the server down again, the two node became busy and the server not accepted new connections. After reviewed again all workspace logs around the time when server down, I found the following message:

018-11-07T14:23:38,535 INFO [00000012] 4:user - NOTE: Data file LIB.TABLE.DATA is in a format that is native to another host, or the file encoding does not match
2018-11-07T14:23:38,535 INFO [00000012] 4:user - the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might
2018-11-07T14:23:38,535 INFO [00000012] 4:user - reduce performance.

This message was written 5 minutes before the nodes became busy. Do you think maybe this will be the cause of the problem?

Regards,

nhvdwalt · Posted 11-08-2018 12:56 AM

No, Cross Environment Data Access (CEDA) is something else. We can discuss on a different post since it's not related to this.

I once had the same issue with a Object Spawner on Grid. Let me just go through my old notes.... There was a hotfix that had to be applied. Let me check...

nhvdwalt · Posted 11-08-2018 01:03 AM

Have a look here....

http://support.sas.com/kb/60/459.html

MariaD · Posted 11-10-2018 10:43 AM

Thanks all! We still working on it. As mentioned before, we have opened a Tech Support, but until now we have not find the solution. So, I share here an update in the hope someone has gone through a similar situation.

A brief description about the problem. We have a grid environment, using Linux, with 2 compute servers. Since 3 weeks ago, with no pattern identified, the nodes came busy and SAS does not accept anymore workspace connections. The only solutions is restart the services.

Yesterday we tried to kill (using kill -9 command) some SAS's process to release CPU but after killed it the CPU was not release. Again the only solutions is restart SAS services. Any idea or suggestion to tried to identify the root cause?

nhvdwalt · Posted 11-10-2018 11:14 AM

Hi @MariaD

Please PM me the Track #

JuanS_OCS · Posted 11-12-2018 05:10 AM

Dear @MariaD, @nhvdwalt,

let me chime in, and my excuses for doing it this late in this discussion. @MariaD I see you are in great hands of @nhvdwalt 🙂

In other GRID environments I had similar kind of issues, and same way to "solve it" so I will share my experiences in case that it might help.

First, I had to understand what was going on, exactly. So I got extra help by onboarding a SAS processes monitoring tool, the Enterprise Session Monitor (ESM) from Boemska ( @boemskats I summon you). This tool helped me to undertand what is going on.

A 100% of the cases, was something related to the management of pooled sessions. Stored Processes, Pooled Workspace Processes ...

So the GRID or non-GRID configuration has some limits, based on:

the settings on the SAS Servers (STP, PWKS) such as number of sessions, duration of the sessions, etc
How those sessions are actually used: heavy and long lasting, using more memory, etc
The Object Spawner itself memory settings
And the hardware limits (hopefully these are not reached)

90% of the times the problem is solved just by:

- knowing how those SAS Server sessions are actually used, historical or real-time data

- getting delta numbers to ensure the limits are not reached: the ones that your environment does handle by configuration and the ones your environment actually handles by usage.... because if the limits are reached, the Object Spawer will break, then you need to restart the GRID clustered Object Spawners, which "solves" the problem ... until limit is reached again.

- adjusting those numbers

Normally, by extending the pools and the seesions/memory settings, you would get your problem magically solved.

Beware of the fact: by doing so, your SAS sessions will consume more RAM (physical) and Virtual (disk) memory, and CPU so please monitor and try to forecast if you might need additional RAM or CPUs (or GRID nodes) on your GRID environment for the coming period.

Some links that might help (they did help me a great deal):

- Understanding the Client-side Pooling Connection Process

- Understanding the Server-side Pooling Connection Process

- Choices in Workspace Server Pooling

- Configure Client-side Pooling Properties for Each Server (to know what to change and where)

- Boemska's Enterprise Session Monitor ( to analyse your SAS processes )

Hope it helps,

Best regards,

Juan

nhvdwalt · Posted 11-12-2018 06:05 AM

Thanks @JuanS_OCS, very valuable pointers indeed.

Hi @MariaD

As the user that runs the Object Spawner (typically sas), please run the below on your server and provide the output:

ulimit -a

MariaD · Posted 11-12-2018 07:16 AM

Hi @nhvdwalt ,

I understand that I need to execute the ulimit command on my two compute server. That's correct?

Regards,

nhvdwalt · Posted 11-12-2018 08:37 AM

Hi @MariaD

Yes, ulimit -a

MariaD · Posted 11-12-2018 07:13 AM

Hi @JuanS_OCS,

I'll review all the links you sent me. As extra comments, we enable the fullstimer option. We reviewed all processes were running when SAS stopped and none used more memory that was available (no swap usage) or had a highest CPU usage.

Regards,

boemskats · Posted 11-12-2018 08:28 AM

Thanks for the summon @JuanS_OCS 🙂

ESM is useful for this kind of thing, but running out of allocated open file pointers can be a tough one to figure out - especially as the kernel doesn't really keep track of it in any useful way.

I've got a very minimal script I use sometimes to diagnose this stuff. You can do this:

1. Save the following script as a file on your server - maybe under your sas user's home directory. Call it something like handleMonitor.sh

#!/bin/bash

while true; do
 echo "$(date -u) =========================" >> $1
 lsof $3 | tr -s ' ' | cut -d " " -f1-3 | sort | uniq -c | sort -rn >> $1
 echo $time $count >> $1
 sleep $2
done

2. Make it an executable:

$[nik@edge ~]$ chmod +x handleMonitor.sh

3. Let it run throughout the day, like this:

$[nik@edge ~]$ nohup ./handleMonitor.sh myOpenFileLog.log 300 &

In this example, myOpenFileLog.log is your target logfile (make sure it's in a location that your executing user is able to write to, if you're in your user's home dir that should be fine) and 300 is your logging interval (every 300 secs / 5 mins).

4. Have a look at the myOpenFileLog.log, you'll get output like this, repeating at the interval set above

Mon Nov 12 13:14:14 UTC 2018 =========================
 41 postgres 22012 esmuser
 40 postgres 22018 esmuser
 40 postgres 22017 esmuser
 40 postgres 22015 esmuser
 39 postgres 22016 esmuser
 36 postgres 22019 esmuser
 36 postgres 22013 esmuser
 24 tmux 5718 esmuser
 19 tmux 13376 esmuser
 17 lsof 17544 esmuser
 14 bash 5740 nik
 14 bash 16866 esmuser
 14 bash 13255 nik
 14 bash 13230 esmuser
 13 ta 13370 esmuser
 13 handleMon 17522 esmuser
 12 lsof 17550 esmuser
 11 sort 17549 esmuser
 11 sort 17547 esmuser
 10 cut 17546 esmuser

This will tell you the count of file handles open per process in the first column, and then give you the process's command, pid and user in the other two. If there's a rogue pid somewhere eating all your file handles, it should be obvious here as a standalone pid with a high number in the first column. Otherwise if there's simply too many concurrent processes running under the same user and hitting their limit, that should also show up as multiple processes with a moderate count. It is an extension to what @nhvdwalt suggests - I think he may be on the right track.

I hope this helps you with your immediate problem. It does feel like you are running into challenges around managing the user generated workload on your SAS environment, especially if the number of EG users on your environment is growing. This is where ESM, the product Juan describes in his post, can help a lot. If you're interested in seeing how, please drop me a line and we can arrange a demo.

Nik

JuanS_OCS · Posted 11-12-2018 08:37 AM

Hi @MariaD,

thanks for that.

I was not refering, per-se, to just using too much RAM or CPU, my excuses for my poor explanation. Please let me provide further explanation.

SAS Stored Process Server and SAS Pooled Workspace Server do work with pools of connections, right? With a limited size. The pool can handle also more sessions that the size of the queue, because they will be put in the queue. But this queue has limits too.

If the limit is reached, you will basically see that no more sessions of that kind of SAS server can be launched. But it could be also that the Object Spawner cannont launch more sessions of any kind, for that Object Spawner, affecting other SAS server sessions.

And this is what it could happen, as example, if you have long STP or PWKS sessions, you have a pool of, say, 4 or 8 simultaneus SAS jobs. If more jobs are coming in while the running ones have not finish, and you push this queue enough, the spawner will find the trouble.

Also, the ulimits, is worth to check indeed. But since you are telling me only about pooled sessions, I can guess ulimits won't be the only thing to fine tune. But extending the pools sizes/settings.

A final note: if you change the pool settings, you would need to restart the Object Spawners to ensure th new settings from metadata are taken in effect.