Re: SAS Workspace: stop accepting client connections

MariaD · Posted 10-28-2018 02:45 PM

Hi folks,

We have SAS 9.4 TSM3 installed on a grid (Linux server). Our users uses SAS EG as primary client and some of our users uses SAS EM too. Last week, some strange behaviour on the server started. The server start normally, but after some time (in some case 2h, in other 6h, we don't find any pattern over the time), the server stop to accept new connections and freeze the existing ones. The only solutions is restart the services. As mentioned, in some case the server works fine for 3h, in other for 6h, etc.

In ObjectSpawner log, I find the following message:

2016-08-26T23:38:08,500 INFO [00000067] :sas - Attempt to connect to peer iom://mor-sasprd04.server:11013;;noredirect,classfactory=6A829631-BE4A-4CFF-82F6-B378705DF34C,user=,encr=SASProprietary,encrlvl=credentials/ for cluster Guide_CRM - Logical Pooled Workspace Server (A5RC0EOG.AW000023) failed with exception <?xml version="1.0" ?><Exceptions><Exception><SASMessage severity="Error">A connection refused usually indicates the target server is not active. Or, if active, it may be listening on a different port or service.</SASMessage></Exception><Exception><SASMessage severity="Error">The Bridge Protocol Engine Socket Access Method was unable to connect to host mor-sasprd04.server, port 11013.</SASMessage></Exception><Exception><SASMessage severity="Error">The TCP/IP tcpSockConnect() support routine failed with error 111 (The connection was refused.).</SASMessage></Exception></Exceptions>.

Searching over SAS Notes, I find the following notes:

When you encounter this problem, you must stop and restart the SAS Object Spawner to release the connections. However, over time as new SAS Pooled Workspace Servers are started and stopped many times, the sockets become slowly exhausted again.

You might be able to work around this problem by setting the Inactive process shutdown timeout to 0 for the SAS Pooled Workspace Server. Then, the SAS Pooled Workspace Servers will be started and not shut down so that only a few extra sockets are used.

Anyone has encounter similar problem? Anyone knows if the message found in SAS Notes apply to this problem?

Regards,

PaulHomes · Posted 10-29-2018 01:56 AM

If you have a look at the Load Balancing Error Messages section of the SAS® 9.4 Intelligence Platform: Application Server Administration Guide you will see that error message:

Attempt to connect to peer failed. Attempt to connect to peer (name) for cluster name (ID) failed with exception message.

Explanation: Load balancing was configured to use several machines to balance the client load, but was unable to connect to one of the other peers as defined in the cluster. Load Balancing will not be able to use that peer at this time. A peer can be another object spawner.

Resolution: This message might be seen during initialization due to the fact the other peers or object spawners have not initialized yet. Once they initialize, load balancing will be able to communicate with them. Ensure that the other peers have initialized.

I don't know exactly how you have your pooled workspace servers configured but I imagine you have multiple object spawners (where there is a peering relationship between them for sharing load balancing information). It sounds like this communications is broken. I would look for things like spawners not running, firewalls blocking spawners communicating with each other etc.

Is the server named mor-sasprd04.server in the error message up and running and the object spawner service running on it?

MariaD · Posted 10-29-2018 09:32 AM

Hi @PaulHomes,

Yes, the server is up and running, and the object spawner is running too (verified using "sas.servers status" script). As you mentioned, we have several ObjectSpawner created.

For example, in this exactly moment, nobody can't access SASApp from SAS EG. SAS EG open and validate the user in Metatadata without any problem, but when user tried to open SASApp connection, the application stay processing and nothing happen.

Any suggestion what is going on?

SASKiwi · Posted 10-29-2018 03:09 PM

If you are having problems diagnosing what is going wrong yourselves you would be better off raising a Tech Support track.

MariaD · Posted 10-29-2018 04:55 PM

Thanks. Today SAS environments works fine until we started the scheduling process. We use LSF. After 2h, appear the following message:

IntiPaths(): Cannot access job <xxxxx> spool directory /sasbin/Platform_Suite/Lsf/spool. Permission denied.

I verify the permission for /sasbin/Platform_Suite/Lsf/spool and is set to "rwxrwxr-x".

Any suggestion?

Regards,

SASKiwi · Posted 10-29-2018 07:35 PM

@MariaD - You are now running into Unix specific problems, and we run exclusively on Windows, so I'm not able to check anything for you. Again, I'd suggest using SAS Tech Support.

MariaD · Posted 10-30-2018 12:20 PM

Hi Folks,

I opened a tech support about this issue, but until now we don't have success to restore the normal operation.

Updating on what is going on. For sometimes the server working fine and it process normally. After a time, the server stop to accept new connection and now nobody can login using SASEG or SAS MC. The connection with metadata, to validate the user permissions, returns timeout. Although we were able to connect to metadata using SAS code running directly over the server.

After that, we restarted again SAS services and now the environment is working normal again. But we staying in some kind of looping, normal for a couple of hours and then everything again.

Any suggestion?

SASKiwi · Posted 10-30-2018 03:40 PM

As a general rule I would check the SAS server logs for errors and provide these back to SAS. If client connections are being refused, the SAS metadata server logs (On Windows, but should be similar on Unix: \SAS\Config\Lev1\SASMeta\MetadataServer\Logs) should be recording these. The errors may help guide you as to what is going wrong.

MariaD · Posted 10-30-2018 06:26 PM

Thanks, @SASKiwi. As I mentioned, we are working with SAS support but until now we don't be able to identify what is the problem. I'm share here to validate if somebody has encountered this kind of problem before.

On Metadata log there is no relevant messages or details.

I have a question, last week our country supposes to change to daylight saving time but the date was changed for two weeks laters. I understand SAS use the time from OS but what about the Java used for SAS (JRE)? There is any change this problem will related with time zone defined on Java?

Regards,

SASKiwi · Posted 10-30-2018 08:01 PM

@MariaD - Have you worked through the troubleshooting section of the Grid Computing doc?

https://documentation.sas.com/?docsetId=gridref&docsetTarget=n0e3evlxrsddxyn1azmin3rhmvnb.htm&docset...

We have never had any issues with switching to daylight saving or back, but then again we don't run the same OS as you. If the computer system clock is not correct then this is not a SAS problem and you should talk to your IT administrator.

MariaD · Posted 11-06-2018 08:31 AM

Hi @SASKiwi,

A new update on this topic. After we reviewed a lot of workspace logs, we find the following messages:

2018-11-01T09:39:20,294 DEBUG [00000008] 4:user - IOM LOGIC DOSMGetClassImage(): image-name=sasoptsv
2018-11-01T09:39:20,300 DEBUG [00000008] 4:user - IOM RETURN W_OK=DOSMGetClassImage(): componentLD=00002AE1EFB272E0
2018-11-01T09:39:20,415 DEBUG [00000008] 4:user - IOM LOGIC DOSMGetClassImage(): image-name=sasfsv
2018-11-01T09:39:20,418 DEBUG [00000008] 4:user - IOM RETURN W_OK=DOSMGetClassImage(): componentLD=00002AE1EFB27620
2018-11-01T11:39:20,934 INFO [00003239] 4:user - A connection reset indicates the peer has closed their connection to this session. This session will remove the peer from its set of active peers.
2018-11-01T11:39:20,934 INFO [00003239] 4:user - The Bridge Protocol Engine Socket Access Method was unable to read peer data.

...

2018-11-01T11:39:20,940 INFO [00000009] :user - NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
2018-11-01T11:39:20,940 INFO [00000009] :user - NOTE: The SAS System used (Total process time):
2018-11-01T11:39:20,940 INFO [00000009] :user - real time 2:00:02.12
2018-11-01T11:39:20,940 INFO [00000009] :user - cpu time 1.12 seconds

If you see, in this first case we have a gap of 2h without any process. Any idea why this happen?

In this other workspace server log, we have a similar situation:

2018-11-01T14:20:58,254 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,255 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,256 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,258 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,263 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,286 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,294 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,296 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,320 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,332 DEBUG [00000004] 4:user - IOM LOGIC VOIMapWrcodeToWoiStatus() Mapping error severity wrcode: 0x80403808
2018-11-01T14:20:58,494 DEBUG [00000012] 4:user - IOM LOGIC DOSMGetClassImage(): image-name=sasbins
2018-11-01T14:20:58,495 DEBUG [00000012] 4:user - IOM RETURN W_OK=DOSMGetClassImage(): componentLD=00002B0027F56840
2018-11-01T14:31:29,136 INFO [00000009] :user - 20 The SAS System 14:18 Thursday, November 1, 2018
2018-11-01T14:31:29,136 INFO [00000009] :user -
2018-11-01T14:31:29,136 INFO [00000009] :user - NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
2018-11-01T14:31:29,136 INFO [00000009] :user - NOTE: The SAS System used (Total process time):
2018-11-01T14:31:29,136 INFO [00000009] :user - real time 12:37.75
2018-11-01T14:31:29,136 INFO [00000009] :user - cpu time 1:21.73

In this second case, we have a gap of 12 minutos and a message about "Mapping error severity wrcode". Any idea?

Regards,

SASKiwi · Posted 11-06-2018 02:17 PM

I suggest you track this to SAS Tech Support if you haven't yet done so. They are in the best position to help.

MariaD · Posted 11-06-2018 02:43 PM

Thanks @SASKiwi, we have an open ticket but until now we don’t have any conclusions...

nhvdwalt · Posted 11-07-2018 12:43 AM

See if you have a large number of TCP/IP connections sitting in a CLOSE_WAIT state

MariaD · Posted 11-07-2018 10:58 AM

Hi @nhvdwalt,

Thanks! In this moment we have 132 SAS's session executing in our two nodes. The server is stable and we have 62 TCP/IP with CLOSE_WAIT status in one of our node and 77 in other. There is any way to force the closing?

Regards,