How do I limit the amount of memory used by Gemfire on SAS_Server1_1 or clear cache data on SASServer1_1 JVM, so I don't have to allocate 24GB of memory for SASServer1_1?
Here are the stuff we have tested and tried
To keep our SAS 9.4 running on Windows 2012 without restarting, we had to set SASServer1_1 JVM -xmx to 24GB. We were able to verify the issue we are having is the same as described the links below
=> 4. Memmory settings on the SASServer1_1 (WIP, SASLogon, all the central and shared apps), therefore it may become a bottleneck...And same document and recommendations as above. I have seen sometimes required to extend Xms and/or Xmx to around 20 GB temporary, but this should happen only under SAS' approval.
=> ...it is recommended by SAS Technical Support to increase a bit the -Xmx value, from 4 GB to even 20 GB in some cases.
We have cleared out SAS_Audit Logs as based on Problem Note 58589 and have manged our WIP database size. In addition, we have managed the logs as recommended. However, we are still seeing SASServer1_1 JVM using up most of the 24GB memory allocated.
We finally decided to do a SASServer1_1 JVM Heap Dump to identify the source of SASServer1_1 JVM. We followed steps from Problem Note 61370. We used the SAS Heap Dump Tool to diagnose Heap Dump File when SASServer1_1 JVM was out of memory. It identified 70% of the JVM Heap was used by Cache Locator / Gemfire. I have also verified this using other JVM tools like JMap and Eclipse MAT. We have done this a few times and everytime, Gemfire is what is using up SASServer1_1 JVM Heap.
The exact class identified using 60% (or 14GB) of SASServer1 JVM heap is "com/gemstone/gemfire/internal/cache/..."
I also noticed that JVM on SASServer1_1 does try to actively to perform GC. After a major GC, the heap released gets filled backup within minutes, which also traced to GemFire filling up SASServer1_1 JVM Heap.
The SASServer1_1 JVM memory appears to used up all the heap size when we have more mid-tier web servers cache data to sync between members.
Yes, I have enabled SAS environment manager to trace this issue as well. The SAS environment manager will report Cache Locator JVM status, which is fine at port 41415. What I am trying to find out is how Gemfire uses JVM Heap allocated to SASServer1_1.
I have also been reading Pivotal Gemfire which Cache Locator is based upon. There is a way to evict cache data, but this configuration is not available on SASServer1_1. In the SASServer1_1 I have gone through the gemfire log and knows it creates storage for the webapps running under SASServer1_1.
I also know I can try to install the server again as described in https://communities.sas.com/t5/Administration-and-Deployment/Server-performance-slow-down/td-p/19884...
Please don't answer : contact support. I did and they keep on asking me to provide log files after log files without actually giving me an clear answer. In addition, I been asked to reproduce the error so support can check and verify ! (This is a production system and I don't want to take it down. I been keeping the server up by adding more memory, limit user connections and restart SASServer1_1 )
Thanks for your response. I am within the session count of SAS sizing. What I want to do is either
1) Manager and monitor the source of the Gemfire cache data on SASServer1_1 JVM memory
2. Stop using Gemfire all together by going from scenario #9 to #5 as described in SAS0415-2017.pdf
If someone could give me instructions, I would greatly appreciated. The system is already in production, so all I can do is change configurations, but not rebuild them. (Nor do I have a spare 8core, 64GB with SSD Windows 2012 servers)
Please take a look my my JVM chart to see if this is normal.
Restarting SASServer1_1 at 3PM
1. SASServer1_1 Log during shutdown at 3PM - Please see file SASServer1_1-Logs.pdf
=> From the log files, Tomcat suspects there may be memory leak in some of the Web Applications.
2. SASServer1_1 JVM : Business Hours (3PM to 5:50PM) - Please see file SASServer1_1-BusinessHours.pdf
=> From the SAS JVM, we see JVM peaking goes to 15GB in minutes.
=> Initially, we thought this may due to users rushing back to the system. However, in this JVM chart, there are no users connected to SAS at this time because we closed all port 80 traffics from our switch.
3. SASServer1_1 JVM: After BusinessHours. (No Major GC with no users) Please see SASServer1_1-AfterBusinessHOurs
=> From #1 and #2, we may have a memory leak. However, as you have proposed, this may be normal. Also many times JVM has been falsely accused of memory leak as described by the link
==> However, this is not the case because during the night, when my thread down comes down and no users are in the system, JVM does not perform a GC.
The JVM heap dump plus #1 + #2 + #3, I believe I have sufficient signs of a JVM memory leak caused by Gemfire / Cache Locator. After many weeks of tracing, I can confirm the memory leak is NOT from SASServers (1_1, 2_1, etc) .
Initial Setup, Mid-Tier Web01, Web02, Web03 with cache locator on Web01. Under this setup, I usually have to restart the server one or more times a day. Then I tried (While maintaining the same users sessions and usage load)
1. Shutdown Web01 hosting mid-tier cache locator, and leaving Web02 and Web03 running. This forced Web02 and web03 to use Cache Locator on SAS Compute, which resulted me restarting every one to two days.
2. Keep Web01, but stopped Cache Locator on Web01. Shutdown Web02 and Web03. Since Cache locator is shutdown on Web01, it uses the cache locator on SAS Compute. This resulted in having to restart every 3 to 5 days
3. No Cache Locator - This is what I want to try, but need help to convert my existing setup by going from scenario #9 to #5 as described in SAS0415-2017.pdf
In my case the issue all appears to point back to Cache Locator / GemFire. For example, once in a blue moon,
1) Web01 log would show remaining members (web02, web03) gone, but the log on web 02 and web03 shows HTTP 502 - bad gateway. After more tracing, it was GemFire that falsely reported members down.
2) It also appears that WIP performance is highly dependent upon Gemfire. Of course, when WIP goes down, everything stops. SAS has done a very good job to isolate WIP issues caused by Postgre, but should also check the impact of GemFire on WIP. If my guess is right, fixing or stopping Gemfire issue will allow SASServer1_1 to allow 2 to 3 times amount of connections / data volume.
Perhaps its not Cache Locator, but I need to know how to fix my SASServer1_1 JVM issue. I cannot manage a memory that is growing faster than Big O(n)
Saturday - last edited Saturday
There are many document cases of memory leak, which requires restarting.
and my favorite one
A memory leak exists that causes SAS Metadata Server performance to degrade over time.
To recover from this situation, you must restart the SAS Metadata Server.
If GemFire / Cache Locator is not the source of Memory Leak, please at least identify that the memory leak is.
Saturday - last edited Saturday
One of the reasons why your Mid-Tier Web Server1_1 is using a lot of memory could be due to non-optimal performance measurement settings. Also are VA users experiencing slow performance while viewing reports? If Server1_1 is memory constrained then I would not be surprised if VA report viewing is slow, with some refreshes taking a minute or two.
Check the Server1_1 log directory. You will find there at least two log files named tmlog*.*. When the server is performing well these logs will be small - a few KB. If these files are large, like 100s of MBs (one of our was > 1TB!) then you know the server could be overloaded doing performance monitoring.
It was interesting to note your experience with SAS Tech Support. This mirrors our own experience with SAS Tech Support trying to solve recent VA poor performance. The reality of getting to the bottom of VA performance problems is it requires a lot of analysis and supplying a lot of logs. In the end SAS Tech Support did indeed come up with the solution - it just took a long time to get there. I really suggest you keep persevering..
Saturday - last edited Sunday
Thanks for the suggestion. I will keep an eye on the size of tmlog*.*
Yes, my user has experienced slow performance in VA, but when this happens, all SAS web apps related slows down too. I also did a lot of tracing and found that server12_1 was actually doing okay. The VA being slow from performance measurement was due to WIP on SASServer1_1. I had to use JConsole, JMap and other tools to find that SAS Environment Manager impacts WIP and PostgreSQL greatly.
Please let me know about the size of your tmlog*.* files. If they are large then I can give you a config setting that may help.