06-09-2015 12:53 PM
We have our GRID nodes running on Windows (I know). So they need to be rebooted periodically. What is the best way to handle running jobs when we need to restart the servers?
06-09-2015 01:03 PM
06-09-2015 03:49 PM
There is no need to reboot the Windows machines. You can have them running for a long period.
Problems are often within the "apolications" like sas. The cause can be memory problems or synchronisation issues. The real solution is a developers question, in this case sas TS and the developers of the SAS system.
Needing to bypass issues in SAS you could plan to restart all SAS servers better word services. Eg the metadata server. In that case your batch processes will not be affected.
Needing a planned outage of the os you can plan that so cancelling running jobs is an expected event
06-10-2015 03:05 AM
What is a "running job"? A scheduled batch job, or something initiated interactively by SAS VA or Enterprise Guide?
06-10-2015 08:27 AM
Both, we have scheduled batch jobs and users running jobs interactively. So I'm wondering if there's a way, or a best practice, to stop processing new jobs say 30 minutes prior to bouncing the servers and a way to "gracefully" stop existing jobs immediately prior to restarting the servers. Thanks for your reply.
06-10-2015 10:02 AM
Batch jobs should always be written in a way that allows them to crash or be stopped unexpectedly, and be rerun without causing damage to data. Eg new observations added to a table should "know" which run added them, and a repeat of that run can filter them out before repeating the table update.
With interactive sessions you can't really know what timespan is right. Some jobs take seconds, some literally days.
In that context I'd like to see a tool that allows a SAS administrator to send messages to metadata-driven clients like EG.
Right now, one has to develop methods to do that outside of SAS or with the use of external commands (like running a ps on UNIX that finds the workspace servers, deducts the userid's, finds the email adresses of those and sends email that the server will be going down).
06-10-2015 01:02 PM
Kurt eguide with grid and parallel code submission is not an interactive only approach anymore. It is more doing batch work. That is the flow processing in eguide and studio offering. Batch processing by selfservice.
There is an advanced topic for dower to think about. That is checkpoint restart in SAS. by that you should be able to cancel long running jobs. The topic is an advanced one with a lot of pre reqs. The only event I have seen checkpoint restart being used is with mainframe job scheduling having jobs for several days to run.