12-06-2016 12:09 AM
I am a performance test lead in one of the leading banks. My client uses SAS as their platform for building one of the systems to monitor frauds.
As a part of business requirement, any request that is sent from ESB (Middleware) to the Fraud system (SAS) will wait only for 3 Seconds, post which ESB considers it as a timeout. We are currently facing weird issue where the SAS RTDM server is operating with <2% timeouts when running with 3 Nodes (basically 2 servers with 2 ports each with a total of 4 nodes, typically load balanced), whereas when we go with 4 nodes up and running, >70% requests are timedout.
Our SAS team here suggests that there are some shared resources causing these issue, but unfortunately for almost a month we are not able to resolve this issue. Can someone shed some light on this issue to resolve it ASAP?
Kindly let me know in case of any additional details, since I am not sure whether the info shared would suffice to clarify my area of doubt!
12-06-2016 11:15 AM - edited 12-06-2016 11:21 AM
Hello @Krish4590, nice question.
I don;t know much about the RTDM solution. But I know about some solutions that require to be clustered and with high availability, such as GRID.
The most general cause of this problem is quite simple: every SAS server/port is a separated sas.exe process. While a sas.exe process is running and processing a request, it won't accept additional requests. That is why you have a pool of connections created. Therefore, if you get timeouts,, most general reason is because the sizing is not adequate: your receive more requests than your connections can handle (or the processes running take much longer than initially estimated).
2 connections/ports seem to be a bit short list to me. I would definetely increase the number of connections/ports available. Most of the problems related to similar indicators are solved on such a simple way. Of course: this might mean an impact on your firewalls and the total amount of memory/cpu used on your servers.
Probably you would like to read the following SAS notes regarding fine tuning of Pooled servers (STP, PWS, etc):
Optionally also: http://support.sas.com/kb/57/180.html
PS. Of course, your SAS team might be right, some resources might be causing the process to take longer than expected. Ideas I can think about is that the disks are queueing a log and making process to be idle for some time. You can analyse this by comparing running times of the process' logs when this was fine and from logs from now. Or monitoring your resources to ensure everything is fine from the lowest level (as disk queues lower than 0.1 and such).
12-08-2016 07:12 AM
Thanks for your detailed reply Mr. Juan!
The reason for avaling 2 nodes per instance of the SAS RTDM server was to facilitate high availability, whereas in contradiction the performance degrades when running with 4 nodes and enhances when one of the node is shut down. The incoming TPS almost doubles when this is done and the processing times are extremely faster.
The same was observed when starting the execution with 3 nodes (1+2) and then when one was brought down (1+1), performance was enhanced. If the number of ports aren't enough to handle the incoming requests this shoudl have not happened when the nodes are reduced, is what my understanding is (Correct me if am wrong!).
Any clues on this inconsistent behavior?
12-06-2016 11:35 PM
looks like you would need to increase the port banks on the object spawner configuration , this is to accept more connections simultaneously.
These timeout settings are set in the cluster using SAS Environment Manager, by default it is set to >2% but you could modify these to a higher % and see if it solves your problem.
12-14-2016 02:15 AM
12-14-2016 07:26 AM
this sounds to me a lot as there is some interference behind the scenes. Which needs some additional analysis.
Let me share with you what I would do on your scenario:
My feeling is that, while your problem seems to be focused on increasing nodes of your high-available platform, increasing the port numbers of your pooled servers (Stored Process and Pooled Workspace servers) would help a lot. (they manage queues as well).
12-15-2016 07:21 AM
Thanks again Mr. @JuanS_OCS for your detailed inputs. To update on the trials made, the points 1 and 2 had been already tried out by changing the logging mode to DEBUG on the Real Time Decision Managers (RTDMs), which helped to tune couple of long running queries, but not more.
Point 3 wasn't attempted, before which we went ahead with point 4 by contacting the SAS support with severity 1, on which there is communication between expert RTDM solution engineers, by sharing logs and configs.
And for the final suggestion on increasing the ports on the pooled connections for SPCs, in the current set-up this is not bsing used and hence would not be prompt for this issue as per our SAS team.