Solved: Re: Gemfire timeout problems - Distributed Visual Analytics

miki7 · Posted 06-29-2017 02:23 AM

Dear SAS Communities,

as a partner of SAS we have client with Visual Analytics distributed solution installed.

In this solution we have these servers:

1 compute node

2 middle-tiers

1 metadata

In last few weeks, we started to have problems with failing gemfire locator, which led to complete fall of web applications where login screen says "blablah

Please contact your administrator for assistance"

The problem is (as we believe) with communication between middle-tier 1 and middle-tier 2, where all webapps including VA and SASStudio runs.

My main question is:

What do you think about increasing the timeout value for gemfire communication between servers? Could be 5 sec too little time? Could be increasing it to like 1 min or more problem?

Here's short sample from gemfire.log on one of middtiers (this one from mid_tier02, but mid_tier01 has nearly same log just with swapped server-name values):

[info 2017/06/28 21:12:10.310 CEST <VERIFY_SUSPECT.TimerThread> tid=0x55] No suspect verification response received from %MID_TIER01%(54776)<v3>:37678 in 5989 milliseconds: I believe it is gone.

[info 2017/06/28 21:12:11.312 CEST <UDP ucast receiver> tid=0x1e] Member %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:11.312 CEST <UDP ucast receiver> tid=0x1e] failure detection received notification that %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:12.315 CEST <UDP ucast receiver> tid=0x1e] Member %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:12.315 CEST <UDP ucast receiver> tid=0x1e] failure detection received notification that %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:13.317 CEST <UDP ucast receiver> tid=0x1e] Member %MID_TIER01%(54776)<v3>:37678 is no longer suspect

[info 2017/06/28 21:12:13.317 CEST <UDP ucast receiver> tid=0x1e] failure detection received notification that %MID_TIER01%(54776)<v3>:37678 is no longer suspect

[info 2017/06/28 21:12:14.319 CEST <UDP ucast receiver> tid=0x1e] Member %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:14.319 CEST <UDP ucast receiver> tid=0x1e] failure detection received notification that %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:15.031 CEST <FD_SOCK Ping thread> tid=0x24] suspecting member %MID_TIER02%(63477)<v6>:40788

[info 2017/06/28 21:12:15.031 CEST <UDP Incoming Message Handler> tid=0x1d] Received Suspect notification for member(s) [%MID_TIER02%(60274)<v5>:21007] from %MID_TIER02%(59925)<v4>:7935.

[info 2017/06/28 21:12:15.180 CEST <ViewHandler> tid=0x44] Membership: sending new view [[%MID_TIER02%(59925)<v4>:7935|19] [%MID_TIER02%(59925)<v4>:7935/56543, %MID_TIER02%(63477)<v6>:40788/56113, %MID_TIER02%(64709)<v7>:55906/49673]] (3 mbrs)

[info 2017/06/28 21:12:15.321 CEST <UDP ucast receiver> tid=0x1e] Member %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:15.322 CEST <UDP ucast receiver> tid=0x1e] failure detection received notification that %MID_TIER01%(50790)<v1>:50358 is no longer suspect

[info 2017/06/28 21:12:15.322 CEST <UDP ucast receiver> tid=0x1e] Member %MID_TIER01%(54776)<v3>:37678 is no longer suspect

[info 2017/06/28 21:12:15.323 CEST <UDP ucast receiver> tid=0x1e] failure detection received notification that %MID_TIER01%(54776)<v3>:37678 is no longer suspect

[info 2017/06/28 21:12:15.323 CEST <UDP Incoming Message Handler> tid=0x1d] Received Suspect notification for member(s) [%MID_TIER02%(60274)<v5>:21007, %MID_TIER02%(63477)<v6>:40788] from %MID_TIER02%(59925)<v4>:7935.

[info 2017/06/28 21:12:15.324 CEST <UDP Incoming Message Handler> tid=0x1d] Membership: received new view [%MID_TIER02%(59925)<v4>:7935|19] [%MID_TIER02%(59925)<v4>:7935/56543, %MID_TIER02%(63477)<v6>:40788/56113, %MID_TIER02%(64709)<v7>:55906/49673]

Thank you very much and have a nice day!

Michal

alexal · Posted 06-29-2017 03:17 AM

@miki7,

Yes, you can increase the timeout value. Stop all the midtier services, SAS Cache Locator, then edit /<SASConfig>/Lev<X>/Web/gemfire/instances/ins_41415/gemfire-start-locator-sas.sh, add the following JVM parameters to the JAVA_ARGS="" value:

-Dgemfire.conserve-sockets=false -Dgemfire.member-timeout=30000

Also, its worth to adjust Dgemfire.member-timeout in /<SASConfig>/Lev<X>/Web/WebAppServer/SASServer1_1/bin/setenv.sh (JVM_OPTS):

-Dgemfire.member-timeout=30000

Start all the midtier services after these changes.

View solution in original post

alexal · Posted 06-29-2017 03:17 AM

@miki7,

Yes, you can increase the timeout value. Stop all the midtier services, SAS Cache Locator, then edit /<SASConfig>/Lev<X>/Web/gemfire/instances/ins_41415/gemfire-start-locator-sas.sh, add the following JVM parameters to the JAVA_ARGS="" value:

-Dgemfire.conserve-sockets=false -Dgemfire.member-timeout=30000

Also, its worth to adjust Dgemfire.member-timeout in /<SASConfig>/Lev<X>/Web/WebAppServer/SASServer1_1/bin/setenv.sh (JVM_OPTS):

-Dgemfire.member-timeout=30000

Start all the midtier services after these changes.