BookmarkSubscribeRSS Feed
sasprofile
Quartz | Level 8

Hi Friends -

 

Need your help on issue with the sas grid setup


We have 3 machines Grid Control Server, Node 1 and Node 2.


As per the below doc I followed all the steps correctly till Chapter 3 and all deamons are up and running and also all lsf commands on Master Grid Server worked correctly,


but I have issues on Grid Nodes. It is in nfs shared directory we have installed and configured the SAS Grid.

 

http://support.sas.com/rnd/scalability/grid/PSS10.1_Unix_Install.pdf

 

From "Chapter 4 - Installing and Configuring LSF on Grid Nodes, SAS Foundation Grid Clients or UNIX" section I performed below steps to make the Grid Node work.

 

Am am not sure whether below followed steps correct or not..or i missed anything else

 

Logged in with root on Grid Node


Added the Grid Node Host name in LSF_CONFDIR/lsf.cluster.cluster_name.


Change into the <LSF_TOP>/10.1/install share directory

Ran the following command to set up the proper initialization files for future reboots:

./hostsetup --top=”/usr/share/lsf” --boot=”y” --profile=”y”
--start=”y”


Ran the following two commands on the grid control node to make the new node known:

'

lsadmin reconfig
badmin reconfig

 

got below error

 

[root@kkg-node1 install]# lsadmin reconfig
Restart only the master candidate hosts? [y/n] n
ls_gethostinfo: LIM is down; try later
Operation aborted
[root@kkg-node1 install]# lsadmin reconfig
Restart only the master candidate hosts? [y/n] y
Restart LIM on <kkg-master.corp.abc.com> ...... ls_limcontrol: Communication time out
[root@JS-node1 install]# lsadmin reconfig
Restart only the master candidate hosts? [y/n] y
Restart LIM on <kkg-master.corp.abc.com> ...... ls_limcontrol: Communication time out


Then I went to Grid Master server to run below commands, since the commands did not work on Node 1

Below Commands worked fine on Master Grid server

 

lsadmin reconfig
badmin reconfig

 

After I ran above Commands on Master Grid server for Node 1 setup all daemons worked correctly on Node 1, Then I ran below commands to Start LSF on the new host:

# lsadmin limstartup # lsadmin resstartup # badmin hstartup


All the Deamons started running

 

Then again I went to Node 1 and sourced the profile and ran the lsf commands, but it does not work and gave below error
So am not sure where am doing mistake and what steps I have missed, please advice.

 

/kkgshare/lsf/conf

[lsfadmin@kkg-node1 conf]$ source ./profile.lsf
[lsfadmin@kkg-node1 conf]$ lsload
lsinfo: Error 10027
[lsfadmin@kkg-node1 conf]$ lsid
lsid: ls_getentitlementinfo() failed: Error 10027
[lsfadmin@kkg-node1 conf]$ lshosts
ls_gethostinfo(): Error 10027

 

 

 

Thank you all in advance

13 REPLIES 13
doug_sas
SAS Employee

Is the path to the share the same on all nodes? Can you see the node in the lshosts, lsload, and bhosts commands run from the master?

sasprofile
Quartz | Level 8

Thank you for your prompt reply

Yes this shared directory shared across all nodes

 

yes I can see the node in the lshosts, lsload, and bhosts commands run from the master, but it is showing as unavailable and unknown.

 

Servers are pingable vice versa from both

 
 
 
 
doug_sas
SAS Employee

Is the shared directory mounted in the same location on all nodes? It should be to work.

You cannot run lsadmin and badmin on the newly added node because that node is not a part of the grid yet so the command would be rejected. That is why it worked on the master node.

 

The host setup command should have started the daemons on the node. Check to make sure the daemons are running on the node and errors are not showing up in the <LSF_TOP>/log directory.

 

Hopefully the nodes can easily talk to one another using their names.

sasprofile
Quartz | Level 8

Yes the lim res sbatchd daemons are running

and when I did df  command I see the shared directory appears mounted on both nodes, where as Master the same shared directory does not show up with df command...when I see in root (cd /) I saw the shared directory there.

 

below is the error I see in Node 1 lim under /lsf/log

 

 

Feb 28 03:12:48 2020 11844 4 3.4.0 periodic: host kkg-node1.abc.com cannot join the cluster; retry 20 times fails
Feb 28 03:12:48 2020 11844 3 3.4.0 periodic: LIM has exited due to a fatal error.
Feb 28 03:30:16 2020 11963 4 3.4.0 periodic: host kkg-node1.abc.com cannot join the cluster; retry 20 times fails
Feb 28 03:30:16 2020 11963 3 3.4.0 periodic: LIM has exited due to a fatal error.
Feb 28 10:43:10 2020 12722 4 3.4.0 periodic: host kkg-node1.abc.com cannot join the cluster; retry 20 times fails

 

 

below is the res daemon error

 

Feb 28 03:01:19 2020 11847 3 10.1 init_AcceptSock: Failed to create socket for RES : Address already in use
Feb 28 03:01:19 2020 11847 3 10.1 Exiting
Feb 28 03:18:41 2020 11966 3 10.1 init_AcceptSock: Failed to create socket for RES : Address already in use
Feb 28 03:18:41 2020 11966 3 10.1 Exiting

Feb 28 10:43:10 2020 12722 3 3.4.0 periodic: LIM has exited due to a fatal error.

doug_sas
SAS Employee

Is something already running and using the LSF ports defined in the lsf.conf file?

doug_sas
SAS Employee
You cannot start LSF daemons using ports that are already in use by something else. Also make sure the ports are open through any firewalls between the nodes in the grid.
sasprofile
Quartz | Level 8

To be honest am not sure about this and how can I check whether its running and using the LSF ports defined in the lsf.conf file

Please advice

doug_sas
SAS Employee

I am not a Linux admin type, but googling shows

  sudo lsof -i -P -n | grep LISTEN

or
  netstat -tulpn | grep LISTEN

 

and look for the ports defined in lsf.conf. The defaults would be something like

# Other variables
LSF_LIM_PORT=7869
LSF_RES_PORT=6878
LSB_MBD_PORT=6881
LSB_SBD_PORT=6882

# Enable mbd query child
LSB_QUERY_PORT=6891

 

sasprofile
Quartz | Level 8

I ran both commands and also some other commands from google, but its giving error "command not found"

 

I found the same port numbers in Master and Node in lsf.conf

 

 

# Other variables
LSF_LIM_PORT=7869
LSF_RES_PORT=6878
LSB_MBD_PORT=6881
LSB_SBD_PORT=6882

# Enable mbd query child
LSB_QUERY_PORT=6891

 

AnandVyas
Ammonite | Level 13
ps -ef | grep 7869

Check with something like
tcp 0 0 0.0.0.0:7869 0.0.0.0:* LISTEN

That means the port is already in use. You will need sudo access to get PID of the process using it.

sudo netstat -tulpn | grep 7869
sasprofile
Quartz | Level 8

I did sudo, still says command not found.

AnandVyas
Ammonite | Level 13
I would suggest you check the section "Verifying the Network Setup" in the document you shared in the original post. If you are unable to find commands working or not installed at OS level, I would suggest you to seek help from System Admins on that.

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 3892 views
  • 0 likes
  • 3 in conversation