BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
woo
Barite | Level 11 woo
Barite | Level 11

Have sas 9.4 in grid environment (Linux).

 

When try to run job (in batch mode -> sas test.sas) and try to dispatched it on specific host (amdusa.company.com) using below statement, its not running on amdusa.company.com but on different grid node/server (in batch mode). + none other user's job dispatching on this host.

 

options metaserver=amdusa.company.com metaport=12345 metauser=userid metapass=xxx;

 

amdusa.company.com host is properly defined in lsb.hosts as well as in list of LSF_MASTER_LIST and LSF_SERVER_HOSTS parameters from lsf.conf. Also, "bhosts" command shows "ok" status for host "amdusa.company.com"

 

job runs fine locally with "./sas -nodms" on amdusa.company.com.

 

check things around but nothing looks missing.

1 ACCEPTED SOLUTION

Accepted Solutions
JuanS_OCS
Amethyst | Level 16

Hello @woo,

 

this is a great question right there, and quite interesting. I wonder, do you have a High-Availability (HA) set up in your grid environment?

 

I can perfectly imagine the fact that your Load Balancer (physical or EGO) is believing this host is down and, somehow, bringing the load from this host to another node in the grid. You could check this in EGO, in RTM or your physical load balancer (with your IT guys).

 

Another option, perhaps you would like to check the resources configuration for this host: queue configuration, queue length, jobs that can run, queue status (maybe full), etc.

 

Anyway, it seems as you host does not have a problem "per-se", since you can run sas code in it locally.

However, if you send it as a grid job, the job is being directed to another host... and this is why I would consider as starting point that the problem is either on your HA configuration (EGO, Load Balancer, RTM) or how the node is registered into the grid (bhosts, lshosts, RTM).

View solution in original post

6 REPLIES 6
JuanS_OCS
Amethyst | Level 16

Hello @woo,

 

this is a great question right there, and quite interesting. I wonder, do you have a High-Availability (HA) set up in your grid environment?

 

I can perfectly imagine the fact that your Load Balancer (physical or EGO) is believing this host is down and, somehow, bringing the load from this host to another node in the grid. You could check this in EGO, in RTM or your physical load balancer (with your IT guys).

 

Another option, perhaps you would like to check the resources configuration for this host: queue configuration, queue length, jobs that can run, queue status (maybe full), etc.

 

Anyway, it seems as you host does not have a problem "per-se", since you can run sas code in it locally.

However, if you send it as a grid job, the job is being directed to another host... and this is why I would consider as starting point that the problem is either on your HA configuration (EGO, Load Balancer, RTM) or how the node is registered into the grid (bhosts, lshosts, RTM).

woo
Barite | Level 11 woo
Barite | Level 11

Thanks a lot Juan, we are looking into it. and not sure what you mean HA available in environment but we have grid environment with like 18 to 20 servers with one metadata server. we do not have any auto failover or any stand-by server if metadata fails. we manually troubleshoot and bring it up. Thanks -

JuanS_OCS
Amethyst | Level 16

Hi again @woo,

 

hmmm, is  amdusa.company.com a metadata server but also a GRID slave node or master node? 

woo
Barite | Level 11 woo
Barite | Level 11

We came across one script under our lsf directory structure which has one environment variable defined which has couple different values (some queues name). There was one "if" statement which defined where job should goes based on user's bash profile. We created new queue for just that specific host and tried to see if job runs on that specific host and it worked fine. We put original script back in place and put that host back in server master list and it started working fine. 

 

So it could be possible something went wrong when we took that host out from master server list and put it back for maintenance purpose. But now everything seems normal. Thanks for your help...appreciate your time. 

jklaverstijn
Rhodochrosite | Level 12

Notwithstanding the fact that you already have a solution I'd like to share my 2 cents and explain how we go about similar tasks.

 

We have a job flow that needs inordinate amounts of SASWORK. We have multiple grid compute nodes. Most have lots of SASWORK with insane speed but one has twice that at ludicrous speed. We direct the jobs to that server by defining a resource called LargeWork in LSF  and configuring that host as providing that resource. Then when scheduling a job you can add the required resource LargeWork to the schedule definition and Process manager will always direct that job to that specific host.

 

Resources are defined in the file lsb.shared. In lsf.cluster.cluster_name you add the resource name to the desired host(s). Do remember to restart the daemons with badmin reconfig and lsadmin reconfig commands.

 

And to second Juan's observation: do youreally  have a metadata server that doubles as a grid compute node?

 

And as far as your metadata server being a single point of failure: have a look at clustering it over multiple hosts. It was a life saver for us many times.

 

Regards,

- Jan.

woo
Barite | Level 11 woo
Barite | Level 11

Thanks Jan, appreciate your input. It makes perfectly sense. 

suga badge.PNGThe SAS Users Group for Administrators (SUGA) is open to all SAS administrators and architects who install, update, manage or maintain a SAS deployment. 

Join SUGA 

Get Started with SAS Information Catalog in SAS Viya

SAS technical trainer Erin Winters shows you how to explore assets, create new data discovery agents, schedule data discovery agents, and much more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2149 views
  • 7 likes
  • 3 in conversation