I have worked with several organisations over the years, all using SAS platforms in all sorts of ways, and shapes. Nowadays I tend to ask the questions when I arrive at a new organisation: Have you and your team sat down, brainstormed and produced a document outlining your platforms biggest risks and weaknesses?
And have you documented your steps to follow to combat these risks, and steps to address these weaknesses.
In most cases the answer is NO. When problems do occur they are simply dealt with on an Ad-hoc basis. I just wondered if there are situations one can address by listing such risks and weaknesses. A typical example of a weakness is: On a sas grid of 10 nodes, someone kicks off a job. The job lands on a random node to be run, say grid node 7. The job soon causes the node to run into trouble by consuming more than 90% of the pagefile memory. And then the node freezes as there is no more pagefile memory available. This is considered a weakness as there is no obvious way to warn of the issue with the SAS job.
Would a suitable step to avert such a weakness be to regularly run a svmon on all the nodes to check the pagefile memory is OK, nothing dodgy going on?
While Environment Manager can perhaps monitor your platform, I do not know if this is included here.
So what other platform weaknesses have you discovered on your Grid? Anything worth documenting and sharing?
But work culture, best practices, usage differ from organization to organization. The role, authority, morale, motivational level and career options of the SAS Administrators also change from organization to organization.
The expectations from SAS Admins , in my opinion, often go beyond SAS.
These factors have an important bearing on this exercise.
The need is felt the most when one joins a new organization as an admin. I do often try. to document what ever I can and hand over to new admin while moving out.
It's more like a run book kinda question highlighting problems that could arise and how to deal with them. While different sites can have different problems based on the way SAS is consumed i.e. for batch, end users, solutions based deployments etc. I don't think there could be one size fit all kind of document highlighting platform specific risks. Few common ones could be the underlying host resources and disk space for sure.
I think if there are a lot of SAS users at your site, it should be a combination of alerting rules, thresholds and user education to avoid such scenarios. For SAS Grid you can have rules like:
- No process can use more than 500G of SASWORK, if they need additional space it has to be requested to the platform team - Limit on the number of jobs submitted - Usage of Q's based on priority - Specific nodes to departments/groups - High volume jobs to be executed in night/weekends - User education on how to plan/use SAS platform to submit jobs that may take high resources