Solved: Re: Cannot push job to specific host in grid environment

woo · Posted 03-23-2018 04:42 PM

Have sas 9.4 in grid environment (Linux).

When try to run job (in batch mode -> sas test.sas) and try to dispatched it on specific host (amdusa.company.com) using below statement, its not running on amdusa.company.com but on different grid node/server (in batch mode). + none other user's job dispatching on this host.

options metaserver=amdusa.company.com metaport=12345 metauser=userid metapass=xxx;

amdusa.company.com host is properly defined in lsb.hosts as well as in list of LSF_MASTER_LIST and LSF_SERVER_HOSTS parameters from lsf.conf. Also, "bhosts" command shows "ok" status for host "amdusa.company.com"

job runs fine locally with "./sas -nodms" on amdusa.company.com.

check things around but nothing looks missing.

JuanS_OCS · Posted 03-25-2018 10:31 AM

Hello @woo,

this is a great question right there, and quite interesting. I wonder, do you have a High-Availability (HA) set up in your grid environment?

I can perfectly imagine the fact that your Load Balancer (physical or EGO) is believing this host is down and, somehow, bringing the load from this host to another node in the grid. You could check this in EGO, in RTM or your physical load balancer (with your IT guys).

Another option, perhaps you would like to check the resources configuration for this host: queue configuration, queue length, jobs that can run, queue status (maybe full), etc.

Anyway, it seems as you host does not have a problem "per-se", since you can run sas code in it locally.

However, if you send it as a grid job, the job is being directed to another host... and this is why I would consider as starting point that the problem is either on your HA configuration (EGO, Load Balancer, RTM) or how the node is registered into the grid (bhosts, lshosts, RTM).

View solution in original post

JuanS_OCS · Posted 03-25-2018 10:31 AM

Hello @woo,

this is a great question right there, and quite interesting. I wonder, do you have a High-Availability (HA) set up in your grid environment?

I can perfectly imagine the fact that your Load Balancer (physical or EGO) is believing this host is down and, somehow, bringing the load from this host to another node in the grid. You could check this in EGO, in RTM or your physical load balancer (with your IT guys).

Another option, perhaps you would like to check the resources configuration for this host: queue configuration, queue length, jobs that can run, queue status (maybe full), etc.

Anyway, it seems as you host does not have a problem "per-se", since you can run sas code in it locally.

However, if you send it as a grid job, the job is being directed to another host... and this is why I would consider as starting point that the problem is either on your HA configuration (EGO, Load Balancer, RTM) or how the node is registered into the grid (bhosts, lshosts, RTM).

woo · Posted 03-27-2018 01:56 PM

Thanks a lot Juan, we are looking into it. and not sure what you mean HA available in environment but we have grid environment with like 18 to 20 servers with one metadata server. we do not have any auto failover or any stand-by server if metadata fails. we manually troubleshoot and bring it up. Thanks -

JuanS_OCS · Posted 03-28-2018 03:22 AM

Hi again @woo,

hmmm, is amdusa.company.com a metadata server but also a GRID slave node or master node?

woo · Posted 04-01-2018 10:57 PM

We came across one script under our lsf directory structure which has one environment variable defined which has couple different values (some queues name). There was one "if" statement which defined where job should goes based on user's bash profile. We created new queue for just that specific host and tried to see if job runs on that specific host and it worked fine. We put original script back in place and put that host back in server master list and it started working fine.

So it could be possible something went wrong when we took that host out from master server list and put it back for maintenance purpose. But now everything seems normal. Thanks for your help...appreciate your time.

jklaverstijn · Posted 04-02-2018 05:23 AM

Notwithstanding the fact that you already have a solution I'd like to share my 2 cents and explain how we go about similar tasks.

We have a job flow that needs inordinate amounts of SASWORK. We have multiple grid compute nodes. Most have lots of SASWORK with insane speed but one has twice that at ludicrous speed. We direct the jobs to that server by defining a resource called LargeWork in LSF and configuring that host as providing that resource. Then when scheduling a job you can add the required resource LargeWork to the schedule definition and Process manager will always direct that job to that specific host.

Resources are defined in the file lsb.shared. In lsf.cluster.cluster_name you add the resource name to the desired host(s). Do remember to restart the daemons with badmin reconfig and lsadmin reconfig commands.

And to second Juan's observation: do youreally have a metadata server that doubles as a grid compute node?

And as far as your metadata server being a single point of failure: have a look at clustering it over multiple hosts. It was a life saver for us many times.

Regards,

- Jan.

woo · Posted 05-30-2018 05:46 PM

Thanks Jan, appreciate your input. It makes perfectly sense.