04-05-2018 08:57 AM
Our production system's load times have gone through the roof over the last week, and can't seem to find any explanation on how it's happened at all. For months, it's been going at 1.5 hours for loading in all the tables we need from our Oracle database but last Thursday this suddenly increased up to 9 hours, and been like that since bar Tuesday night where we got it down to 2.5 hours, but was then back to its 9 hours timescale last night.
As this is a production system, it's having a huge effect on our users so I've been trying everything I can think of to see what the issue is (is it DB related, network related, hardware related or SAS related), but it seems to be like finding a needle in a haystack.
Going through logs (and I enabled FULLSTIMER to get extra stats), the CPU time seems to be consistent from the good runs to the bad runs, but the real time is the one that increased.
Would anyone have encountered any similar issues, or know of anything worth looking into to reduce the real time of the jobs.
04-23-2018 05:42 AM
Turns out it looks like it's McAfee and anti virus scanning that's causing it.
No idea how (yet), but just incase anyone else has this issue in the future.
04-05-2018 09:01 AM
Make sure that your disks on the SAS server are not the culprit (compare "local" jobs from 2 weeks ago with now).
But that just needs to be checked, my main avenue of investigation would be the network. Then the network, and after that the network.
04-05-2018 09:50 AM
Thanks for that. Anything in particular I should be looking at. I've been looking at the trends in the Environment Manager and can see a few things are up but don't know if they're causes or effects of any issues.
Like from a network point of view, I can see that the packets transmitted per minute has dropped significantly since like the 22nd March, but that doesn't really coincide with the days we're having issues. CPU usage has also increased since Thursday from about 10% average to 30% but that could just be as a result of the jobs taking much longer.
Anything else I'm looking at within the Environment Manager isn't really giving me any indicator that something is wrong
04-05-2018 09:56 AM
Try to isolate the Oracle query, so you can narrow down the problem. Like running the query but not storing the result if possible.
Also, try run the same query local in Oracle. If you haven't already done so, involve Oracle DBA's in the work.
FULLSTIMER will not tell you anything what's going on outside SAS.
04-05-2018 11:15 AM
I've involved the Oracle DBAs we have here but they can't see anything on their side as to the cause of it. Query is a normal Select * from table and we're looking at about 14m rows and 5 columns on it. We've had a DBA run it off multiple different environments but he's not really spotting anything that's causing any slowness. I'd have assumed (can correct me with if it's a wrong assumption) that with the CPU time being so much slower than the Real Time, that the issue is on SAS or the infrastructure.
04-05-2018 11:33 AM
Typically, drastic changes like these would pin down to infrastructure changes outside the control of the software!
It's either a Network Bandwidth changes, or Disk I/O threshold changes or Worse Case Scenario - BOTH!!
If your Oracle Database didn't have version changes nor patches applied, that means there were no software changes on the Oracle Part.
IF your SAS programs and scripts didn't have any modifications applied to them, that means there were no software changes on the SAS part.
That leaves you with your Storage & Network Admins to talk to and find out if there has been any recent changes .
Hope this helps,
04-05-2018 02:41 PM
Could be there are too many users at those peak times, @AhmedAl_Attar.
I now know this is neither here nor there -
v/r and with Regards,
ex-dMiner, always wSeeker
04-10-2018 09:11 AM
Doesn't seem to be Network or IO related anyway from what we've seen. We thought it might have been network but tried a few things there and after every time we reboot the server, it seems fine, but each successive load gets a lot slower then.
Once we reboot and do a load, it takes it's normal time, next load takes twice as long, and then 3rd load takes 3 times as long and stays like that.
Anything that might be getting filled up somewhere that could explain this
04-10-2018 09:39 AM
Not knowing how your Load Process is launched nor what it's involves, but have you checked if there were any Processes/Temporary Files left over after your load is "done"?
Rebooting a server should never be a go-to solution!! What kind of a Production System you are running?
You'll need to have your ETL developers re-examine their routines and steps to ensure they are not implementing techniques causing such behavior.
04-10-2018 10:26 AM
And the use the Windows Task Manager to search for sas processes
With all that said, You still need to look into
- How your load Process launched - Batch/Interactive
- How your ETL developers developed the processes - Full re-load/Incremental update?
You'll need to understand the full picture before taking actions that could make things worse!
Need further help from the community? Please ask a new question.