Hi all
I have worked with several organisations over the years, all using SAS platforms in all sorts of ways, and shapes.
Nowadays I tend to ask the questions when I arrive at a new organisation: Have you and your team sat down, brainstormed and produced a document outlining your platforms biggest risks and weaknesses?
And have you documented your steps to follow to combat these risks, and steps to address these weaknesses.
In most cases the answer is NO.
When problems do occur they are simply dealt with on an Ad-hoc basis.
I just wondered if there are situations one can address by listing such risks and weaknesses.
A typical example of a weakness is:
On a sas grid of 10 nodes, someone kicks off a job. The job lands on a random node to be run, say grid node 7. The job soon causes the node to run into trouble by consuming more than 90% of the pagefile memory. And then the node freezes as there is no more pagefile memory available.
This is considered a weakness as there is no obvious way to warn of the issue with the SAS job.
Would a suitable step to avert such a weakness be to regularly run a svmon on all the nodes to check the pagefile memory is OK, nothing dodgy going on?
While Environment Manager can perhaps monitor your platform, I do not know if this is included here.
So what other platform weaknesses have you discovered on your Grid? Anything worth documenting and sharing?