01-15-2018 04:03 PM - last edited on 01-20-2018 09:47 AM by ChrisHemedinger
SAS FULLSTIMER stat, system CPU. I am investigating a particular SAS step where the user CPU is 1 hour and system CPU 1 hr. 45 min. and the difference between the total of CPU time and real time is > 15%. All other steps do not have this issue. Same code on a Linux server though takes a shorter time but user CPU is 35 min, system CPU 47 min and only 2% difference in CPU time and real time.
I have researched extensively to look into the causes why would the system CPU category is higher and that long(er). I am inclined towards looking at IO, network 9nic and firmware), excessive audit and pursue ETW. I am also asking here for any of good folks here may have any thoughts, ideas to investigate further.
Thank you in advance.
01-15-2018 04:11 PM
01-15-2018 04:11 PM
01-15-2018 04:16 PM
I suggest you post the SAS log of the step (including FULLSTIMER notes) on the community as well to get feedback from community experts.
01-16-2018 02:28 PM
At Boemska we offer a product called Enterprise Session Monitor for SAS. It's a piece of software that plugs into your SAS Environment and profiles the resource utilisation of individual SAS jobs, producing timeseries data which our customers use to optimise job performance, often focusing on single problem steps like the one you describe.
ESM records and visualises the CPU/memory utilisation, temp directory size and IO throughput of each individual job, allowing you to contrast it with the resource profile of the node it's executing on, showing metrics like iowait, per-device throughput (for both storage and network devices), disk queue lengths and cache/swap size. The data is very granular (2s intervals) and the interactive investigative workflow makes root cause analysis a relatively pleasant experience.
We're a SAS partner organisation & this is a separate proprietary product, but we offer a free 60 day trial, meaning you could take it for a spin for a couple of months with a view to resolve your immediate issue, no strings attached. Feel free to contact me privately if you're interested.
01-17-2018 04:31 AM
There is a third CPU timing metric, wait-for-IO. You should include that in your evaluation as well. Depending on your topology (any flavour of NFS or CIFS storage will be detrimental) this may give a more accurate interpretation of what you are observing. Actually it is not clear if what you are seeing is in fact a problem.
The elapsed (wall clock) time is dependant on much more than just your code. If you are running on a highly loaded system the ratio will be higher. The same job running at a different time of day may show vastly different results. So look at the system activity next to your job. As you already suspected IO and network can be at play. Run a vmstat or nmon or whatever at your disposal alongside your job to see what's going on. The tooling from @boemskatscan do this even better. Your metrics depend heavily not only on your job but on others as well.
If you have a challenge (which, again, is not entirely clear from what info you provided) I would work with @MargaretC and her team. They are excellent. If they haven't seen it before it probably doesn't exist.
01-19-2018 04:49 PM
Need further help from the community? Please ask a new question.