We have installed SAS 9.4 M7 on Linux environment (grid). We have a DR environment too. Last week we switch to the DR site. The /sasbin and /sasconfig filesystems are shared between sites. We have 3 servers on the grid for compute.
After we switched to the second site, we experience intermittence on LSF. After exactly 24 hours, 2 of 3 servers stop receiving jobs (always the same two servers). If we verified the services, it appears ok (LSF or ObjectSpwaner). Then these two servers stop receiving jobs, the following error appears on sbatchd log:
get_new_master(): ns_getHostNameBySockaddr_(xx.xx.xx.xx: 57626) failed, errmsg: Name or service not know.
After the error appears, the only way to work again is to restart the LSF service on these two servers. Any idea why LSF stop?
Thanks, @gwootton . In fact, last week we have a problem with DNS reverse for these two machines but it was resolved. Then the error appears, we tried a ping -a to these servers, and looks it's resolves ok.
Today, when the services were down we tried to add a host file on /.../Platform_Suite/lsf folder. The host file was created with lsfadmin user and we added the 3 compute servers on it. But after we restarted the LSF services nothing worked (the order was: stopped the LSF services on 3 servers, stopped the ObjectSpawner on 3 servers, created hosts file on the mentioned folder, started the LSF and ObjectSpawner in each server). The logs don't show any errors. So, we decided to delete de hosts file created, we restarted the LSF services again and it worked againg (at least for the next 24h, 😩).
Anything we did wrong? or in the wrong order?
Looking on LSF config file, we only find that the list of the MASTER_SERVER (order of the server names) are not exactly the same on lsf.conf and ego.conf. May this have any relationship with our problem?
The SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment.
Learn how to install the SAS Viya CLI and a few commands you may find useful in this video by SAS’ Darrell Barton.
Find more tutorials on the SAS Users YouTube channel.