BookmarkSubscribeRSS Feed
nhvdwalt
Barite | Level 11

CLOSE_WAIT is a normal state and it's ok to have some of them. If you have like 1000's then it will be a problem.

 

You could maybe script the CLOSE_WAIT monitoring and see how many you have when next you run into this issue. Unfortunately you'll have to wait and see what the system state is upon next occurrence.

MariaD
Barite | Level 11

Thanks  @nhvdwalt. Today the server down again, the two node became busy and the server not accepted new connections. After reviewed again all workspace logs around the time when server down, I found the following message:

 

018-11-07T14:23:38,535 INFO [00000012] 4:user - NOTE: Data file LIB.TABLE.DATA is in a format that is native to another host, or the file encoding does not match
2018-11-07T14:23:38,535 INFO [00000012] 4:user - the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might
2018-11-07T14:23:38,535 INFO [00000012] 4:user - reduce performance.

 

This message was written 5 minutes before the nodes became busy. Do you think maybe this will be the cause of the problem?

 

Regards,

nhvdwalt
Barite | Level 11

No, Cross Environment Data Access (CEDA) is something else. We can discuss on a different post since it's not related to this.

 

I once had the same issue with a Object Spawner on Grid. Let me just go through my old notes.... There was a hotfix that had to be applied. Let me check...

MariaD
Barite | Level 11

Thanks all! We still working on it. As mentioned before, we have opened a Tech Support, but until now we have not find the solution. So, I share here an update in the hope someone has gone through a similar situation.

 

A brief description about the problem. We have a grid environment, using Linux, with 2 compute servers. Since 3 weeks ago, with no pattern identified, the nodes came busy and SAS does not accept anymore workspace connections. The only solutions is restart the services.

 

Yesterday we tried to kill (using kill -9 command) some SAS's process to release CPU but after killed it the CPU was not release. Again the only solutions is restart SAS services. Any idea or suggestion to tried to identify the root cause?

 

 

nhvdwalt
Barite | Level 11

Hi @MariaD

 

Please PM me the Track #

JuanS_OCS
Azurite | Level 17

Dear @MariaD@nhvdwalt,

 

let me chime in, and my excuses for doing it this late in this discussion. @MariaD I see you are in great hands of @nhvdwalt 🙂

 

In other GRID environments I had similar kind of issues, and same way to "solve it" so I will share my experiences in case that it might help.

 

First, I had to understand what was going on, exactly. So I got extra help by onboarding a SAS processes monitoring tool, the Enterprise Session Monitor (ESM) from Boemska ( @boemskats I summon you). This tool helped me to undertand what is going on.

 

A 100% of the cases, was something related to the management of pooled sessions. Stored Processes, Pooled Workspace Processes ... 

 

So the GRID or non-GRID configuration has some limits, based on:

  • the settings on the SAS Servers (STP, PWKS) such as number of sessions, duration of the sessions, etc
  • How those sessions are actually used: heavy and long lasting, using more memory, etc
  • The Object Spawner itself memory settings
  • And the hardware limits (hopefully these are not reached)

90% of the times the problem is solved just by:

- knowing how those SAS Server sessions are actually used, historical or real-time data

- getting delta numbers to ensure the limits are not reached: the ones that your environment does handle by configuration and the ones your environment actually handles by usage.... because if the limits are reached, the Object Spawer will break, then you need to restart the GRID clustered Object Spawners, which "solves" the problem ... until limit is reached again.

- adjusting those numbers

 

Normally, by extending the pools and the seesions/memory settings, you would get your problem magically solved.

 

Beware of the fact: by doing so, your SAS sessions will consume more RAM (physical) and Virtual (disk) memory, and CPU so please monitor and try to forecast if you might need additional RAM or CPUs (or GRID nodes) on your GRID environment for the coming period.

 

Some links that might help (they did help me a great deal):

- Understanding the Client-side Pooling Connection Process

- Understanding the Server-side Pooling Connection Process

- Choices in Workspace Server Pooling

- Configure Client-side Pooling Properties for Each Server (to know what to change and where)

- Boemska's Enterprise Session Monitor ( to analyse your SAS processes )

 

Hope it helps,

 

Best regards,

Juan

 

 

 

 

nhvdwalt
Barite | Level 11

Thanks @JuanS_OCS, very valuable pointers indeed.

 

Hi @MariaD

 

As the user that runs the Object Spawner (typically sas), please run the below on your server and provide the output:

 

ulimit -a

 

 

MariaD
Barite | Level 11

Hi @nhvdwalt ,

 

I understand that I need to execute the ulimit command on my two compute server. That's correct?

 

Regards,

nhvdwalt
Barite | Level 11

Hi @MariaD

 

Yes, ulimit -a

MariaD
Barite | Level 11

Hi @JuanS_OCS,

 

I'll review all the links you sent me. As extra comments, we enable the fullstimer option. We reviewed all processes were running when SAS stopped and none used more memory that was available (no swap usage) or had a highest CPU usage.

 

Regards,

boemskats
Lapis Lazuli | Level 10

Thanks for the summon @JuanS_OCS 🙂

 

ESM is useful for this kind of thing, but running out of allocated open file pointers can be a tough one to figure out - especially as the kernel doesn't really keep track of it in any useful way.

 

I've got a very minimal script I use sometimes to diagnose this stuff. You can do this:

 

1. Save the following script as a file on your server - maybe under your sas user's home directory. Call it something like handleMonitor.sh

 

#!/bin/bash

while true; do
echo "$(date -u) =========================" >> $1
lsof $3 | tr -s ' ' | cut -d " " -f1-3 | sort | uniq -c | sort -rn >> $1
echo $time $count >> $1
sleep $2
done

 

2. Make it an executable:

 

$[nik@edge ~]$ chmod +x handleMonitor.sh

 

3. Let it run throughout the day, like this:

 

$[nik@edge ~]$ nohup ./handleMonitor.sh myOpenFileLog.log 300 &

 

In this example, myOpenFileLog.log is your target logfile (make sure it's in a location that your executing user is able to write to, if you're in your user's home dir that should be fine) and 300 is your logging interval (every 300 secs / 5 mins).

 

 

4. Have a look at the myOpenFileLog.log, you'll get output like this, repeating at the interval set above

 

 

Mon Nov 12 13:14:14 UTC 2018 =========================
41 postgres 22012 esmuser
40 postgres 22018 esmuser
40 postgres 22017 esmuser
40 postgres 22015 esmuser
39 postgres 22016 esmuser
36 postgres 22019 esmuser
36 postgres 22013 esmuser
24 tmux 5718 esmuser
19 tmux 13376 esmuser
17 lsof 17544 esmuser
14 bash 5740 nik
14 bash 16866 esmuser
14 bash 13255 nik
14 bash 13230 esmuser
13 ta 13370 esmuser
13 handleMon 17522 esmuser
12 lsof 17550 esmuser
11 sort 17549 esmuser
11 sort 17547 esmuser
10 cut 17546 esmuser

 

 

This will tell you the count of file handles open per process in the first column, and then give you the process's command, pid and user in the other two. If there's a rogue pid somewhere eating all your file handles, it should be obvious here as a standalone pid with a high number in the first column. Otherwise if there's simply too many concurrent processes running under the same user and hitting their limit, that should also show up as multiple processes with a moderate count. It is an extension to what @nhvdwalt suggests - I think he may be on the right track.

 

I hope this helps you with your immediate problem. It does feel like you are running into challenges around managing the user generated workload on your SAS environment, especially if the number of EG users on your environment is growing. This is where ESM, the product Juan describes in his post, can help a lot. If you're interested in seeing how, please drop me a line and we can arrange a demo.

 

 

Nik

JuanS_OCS
Azurite | Level 17

Hi @MariaD,

 

thanks for that.

 

I was not refering, per-se, to just using too much RAM or CPU, my excuses for my poor explanation. Please let me provide further explanation.


SAS Stored Process Server and SAS Pooled Workspace Server do work with pools of connections, right? With a limited size. The pool can handle also more sessions that the size of the queue, because they will be put in the queue. But this queue has limits too. 

If the limit is reached, you will basically see that no more sessions of that kind of SAS server can be launched. But it could be also that the Object Spawner cannont launch more sessions of any kind, for that Object Spawner, affecting other SAS server sessions.

 

And this is what it could happen, as example, if you have long STP or PWKS sessions, you have a pool of, say, 4 or 8 simultaneus SAS jobs. If more jobs are coming in while the running ones have not finish, and you push this queue enough, the spawner will find the trouble.

 

Also, the ulimits, is worth to check indeed. But since you are telling me only about pooled sessions, I can guess ulimits won't be the only thing to fine tune. But extending the pools sizes/settings.

 

A final note: if you change the pool settings, you would need to restart the Object Spawners to ensure th new settings from metadata are taken in effect.

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 27 replies
  • 10882 views
  • 11 likes
  • 6 in conversation