We use SAS grid (interactive and batch) and IBM LSF (independent of SAS) in our enterprise. Consider one node in the grid that has RAM of 376GB. Any process that is spawned is allocated 1 GB of memory during start up and the SAS launch script has a mem limit of 7 GB. On the LSF app/ queue setting we have restricted the max mem limit to be 10 GB.
We are facing a resource crunch situation due to which one of the nodes goes into closed_busy status. On checking the indices for that node, we see memory utilization as being the plausible cause of closure. However, the output of 'free -g' is as follows:
free -g total used free shared buff/cache available Mem: 376 18 1 1 356 59 Swap: 15 3 12
I think it means that there is enough memory available on the server (buff/ cache col = 356). Also, on checking RTM the physical mem utilization of all process on that node seem reasonable (~1-2 GB for each of the 15-20 process. So total util of ~15-50 GB). However, in RTM for these processes, I also see the following. Is there any relation between V memory and physical memory utilization on a node from LSF/ SAS perspective? I thought LSF only sees the physical mem util on the node and closes it if its too high. In our case, the physical mem util is not high. Also V memory utilization is high but only for a few secs. So why does LSF close the node?
The host status 'closed_busy' usually indicates that one of the host's load indices has exceeded the amount specified in the lsf.cluster.<cluster_name> file in the LSF configuration directory. An example would be
HOSTNAME model type server r1m mem swp RESOURCES #Keywords
apple Sparc5S SUNSOL 1 3.5 1 2 (sparc bsd) #Example
In this example, if the 'r1m' load index exceeded 3.5 or the available physical memory used went below 1MB the host 'apple' would show as closed_busy.