BookmarkSubscribeRSS Feed
MariaD
Barite | Level 11

Hi folks,

 

We have installed SAS 9.4 M7 on Linux environment (grid). We have a DR environment too. Last week we switch to the DR site. The /sasbin and /sasconfig filesystems are shared between sites. We have 3 servers on the grid for compute.

 

After we switched to the second site, we experience intermittence on LSF. After exactly 24 hours, 2 of 3 servers stop receiving jobs (always the same two servers). If we verified the services, it appears ok (LSF or ObjectSpwaner). Then these two servers stop receiving jobs, the following error appears on sbatchd log:

 

get_new_master(): ns_getHostNameBySockaddr_(xx.xx.xx.xx: 57626) failed, errmsg: Name or service not know.

 

After the error appears, the only way to work again is to restart the LSF service on these two servers. Any idea why LSF stop? 

 

Regards, 

6 REPLIES 6
JuanS_OCS
Amethyst | Level 16
Hi MariaD,
Have you enabled Kerberos on the service account?
At first sight, and without looking into logs, to me this seems a Kerberos TGT ticket expiration issue.
Either that or a strange network issue, that is refreshing but not properly, after 24 hours.
Does it help?
MariaD
Barite | Level 11

Thanks, we don't have enabled Kerberos on the service account. 

gwootton
SAS Super FREQ
It looks like we are failing to reverse lookup the hostnames by IP correctly. You may need to configure an LSF hosts file and confirm reverse DNS lookup works as expected.

https://www.ibm.com/support/pages/usage-lsf-hosts-file
--
Greg Wootton | Principal Systems Technical Support Engineer
MariaD
Barite | Level 11

Thanks, @gwootton . In fact, last week we have a problem with DNS reverse for these two machines but it was resolved. Then the error appears, we tried a ping -a to these servers, and looks it's resolves ok.

 

Today, when the services were down we tried to add a host file on /.../Platform_Suite/lsf folder. The host file was created with lsfadmin user and we added the 3 compute servers on it. But after we restarted the LSF services nothing worked (the order was: stopped the LSF services on 3 servers, stopped the ObjectSpawner on 3 servers, created hosts file on the mentioned folder, started the LSF and ObjectSpawner in each server). The logs don't show any errors. So, we decided to delete de hosts file created, we restarted the LSF services again and it worked againg (at least for the next 24h, ‌😩‌). 

 

Anything we did wrong? or in the wrong order?

 

Looking on LSF config file, we only find that the list of the MASTER_SERVER (order of the server names) are not exactly the same on lsf.conf and ego.conf. May this have any relationship with our problem?

 

Regards, 

gwootton
SAS Super FREQ
The error message you've provided is complaining about reverse DNS lookup.

I'm thinking that maybe the LSF controller (responsible for sending jobs to hosts) only sees itself as an eligible host after the DNS cache expires and it can no longer reverse lookup the other hosts. I suspect during a failure event if you were to run lshosts or bhosts it would show the other hosts as unavailable or closed.

The hosts file should be stored in the LSF_CONFDIR location, which is typically called "conf", so you may have stored the file in the wrong place.
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=files-hosts

If both LSF_MASTER_LIST and EGO_MASTER_LIST are defined, LIM will use EGO_MASTER_LIST if EGO is enabled, and LSF_MASTER_LIST if EGO is disabled, but they should match.
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=lsfconf-lsf-master-list
--
Greg Wootton | Principal Systems Technical Support Engineer
MariaD
Barite | Level 11

Hi @gwootton , Thanks for the explanation. Yes, you're right, we placed the hosts' file in the wrong path. Tomorrow night we'll expect to make the test again. After that, let you know. 

 

Regards, 

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1206 views
  • 0 likes
  • 3 in conversation