Re: SAS 9.4 Performance issue | Clustered architecture for High availa...

Krish4590 · Posted 12-06-2016 12:09 AM

Hi team,

I am a performance test lead in one of the leading banks. My client uses SAS as their platform for building one of the systems to monitor frauds.

As a part of business requirement, any request that is sent from ESB (Middleware) to the Fraud system (SAS) will wait only for 3 Seconds, post which ESB considers it as a timeout. We are currently facing weird issue where the SAS RTDM server is operating with <2% timeouts when running with 3 Nodes (basically 2 servers with 2 ports each with a total of 4 nodes, typically load balanced), whereas when we go with 4 nodes up and running, >70% requests are timedout.

Our SAS team here suggests that there are some shared resources causing these issue, but unfortunately for almost a month we are not able to resolve this issue. Can someone shed some light on this issue to resolve it ASAP?

Kindly let me know in case of any additional details, since I am not sure whether the info shared would suffice to clarify my area of doubt!

JuanS_OCS · Posted 12-06-2016 11:15 AM

Hello @Krish4590, nice question.

I don;t know much about the RTDM solution. But I know about some solutions that require to be clustered and with high availability, such as GRID.

The most general cause of this problem is quite simple: every SAS server/port is a separated sas.exe process. While a sas.exe process is running and processing a request, it won't accept additional requests. That is why you have a pool of connections created. Therefore, if you get timeouts,, most general reason is because the sizing is not adequate: your receive more requests than your connections can handle (or the processes running take much longer than initially estimated).

2 connections/ports seem to be a bit short list to me. I would definetely increase the number of connections/ports available. Most of the problems related to similar indicators are solved on such a simple way. Of course: this might mean an impact on your firewalls and the total amount of memory/cpu used on your servers.

Probably you would like to read the following SAS notes regarding fine tuning of Pooled servers (STP, PWS, etc):

http://support.sas.com/kb/40/567.html

http://support.sas.com/kb/48/421.html

Optionally also: http://support.sas.com/kb/57/180.html

PS. Of course, your SAS team might be right, some resources might be causing the process to take longer than expected. Ideas I can think about is that the disks are queueing a log and making process to be idle for some time. You can analyse this by comparing running times of the process' logs when this was fine and from logs from now. Or monitoring your resources to ensure everything is fine from the lowest level (as disk queues lower than 0.1 and such).

Krish4590 · Posted 12-08-2016 07:12 AM

Thanks for your detailed reply Mr. Juan!

The reason for avaling 2 nodes per instance of the SAS RTDM server was to facilitate high availability, whereas in contradiction the performance degrades when running with 4 nodes and enhances when one of the node is shut down. The incoming TPS almost doubles when this is done and the processing times are extremely faster.

The same was observed when starting the execution with 3 nodes (1+2) and then when one was brought down (1+1), performance was enhanced. If the number of ports aren't enough to handle the incoming requests this shoudl have not happened when the nodes are reduced, is what my understanding is (Correct me if am wrong!).

Any clues on this inconsistent behavior?

Regards,

Krish

Krish4590 · Posted 12-14-2016 02:11 AM

Hi Mr. Juan,

Any inputs for this behavior of the system?

Regards,
Krish

MadhuKiran1 · Posted 12-06-2016 11:35 PM

looks like you would need to increase the port banks on the object spawner configuration , this is to accept more connections simultaneously.

These timeout settings are set in the cluster using SAS Environment Manager, by default it is set to >2% but you could modify these to a higher % and see if it solves your problem.

Thanks !!

Krish4590 · Posted 12-14-2016 02:15 AM

Thanks for your reply Mr. MadhuKiran. Increasing the ports is actually the problem. The application works perfect when turning off one of the nodes out of 4 (3 nodes) and we are able to achieve 100+ TPS, whereas in 4 nodes we are able to achieve only 45 TPS.

And the timeout mentioned ❤️ Seconds is the business requirement that any request should be responded witihn that time, else it would be automatically routed to the next component, considering this as timeout from SAS server.

Regards,
Krish

JuanS_OCS · Posted 12-14-2016 07:26 AM

Hello @Krish4590,

this sounds to me a lot as there is some interference behind the scenes. Which needs some additional analysis.

Let me share with you what I would do on your scenario:

Check the logs of your high-availability system, since I would mostly expect issues with the queues or allocating resources. Is it done with IBM/LSF or another product? (sorry, as said, I miss experience with RTDM)
Your SAS Logs (RTDM web application, STPs, etc) should be able to give some information, for heads-up at least. You might need to increase logs to the DEBUG temporary, which will increase response time as well a bit.
Additionaly, nevertheless, get on-board a network analizer tool, such as WireShark. I can imagine package loses or ports closes due to collissions or capacity on the network. Again, point 1 and 2 are more interesting.
I would involve SAS Techical Support with an email or call from momentum 0.

My feeling is that, while your problem seems to be focused on increasing nodes of your high-available platform, increasing the port numbers of your pooled servers (Stored Process and Pooled Workspace servers) would help a lot. (they manage queues as well).

Krish4590 · Posted 12-15-2016 07:21 AM

Thanks again Mr. @JuanS_OCS for your detailed inputs. To update on the trials made, the points 1 and 2 had been already tried out by changing the logging mode to DEBUG on the Real Time Decision Managers (RTDMs), which helped to tune couple of long running queries, but not more.

Point 3 wasn't attempted, before which we went ahead with point 4 by contacting the SAS support with severity 1, on which there is communication between expert RTDM solution engineers, by sharing logs and configs.

And for the final suggestion on increasing the ports on the pooled connections for SPCs, in the current set-up this is not bsing used and hence would not be prompt for this issue as per our SAS team.

Sincere regards,

Krish

SAS 9.4 Performance issue | Clustered architecture for High availability

Re: SAS 9.4 Performance issue | Clustered architecture for High availability

Re: SAS 9.4 Performance issue | Clustered architecture for High availability

Re: SAS 9.4 Performance issue | Clustered architecture for High availability

Re: SAS 9.4 Performance issue | Clustered architecture for High availability

Re: SAS 9.4 Performance issue | Clustered architecture for High availability

Re: SAS 9.4 Performance issue | Clustered architecture for High availability

Re: SAS 9.4 Performance issue | Clustered architecture for High availability