Re: LSF Job ID gets reset

jklaverstijn · Posted 01-17-2018 10:27 AM

We use IBM LSF 9.1.3 for our batch scheduling and grid management. Now for the second time (in months, over 7 different environments, so it's a rare event) I have noticed that the job id sequence gets reset to 1.The job id is what's found in environment variable LSB_JOBID and displayed by commands like bjobs and bhist.

For LSF itself this is not the end of the world but we collect job data for monitoring and reporting. The job id is part of the key of many tables. It is assumed to be unique but now we end up with duplicate keys which throws our monitoring off.

I have gone through the logs and do see errors in the mbatchd and sbatchd logs. Mainly network related. These errors seem correlated due to the time they occur by I fail to come up with causality.

Another noteworthy thing is that the current lsb.events file starts at the time of the reset. It no longer connects with the historic version lsb.events.1. Looks like lsb.events was reset as well.

Has anyone else seen this before? And is anyone familiar with the mechanism behind how the next job id is determined and how we can explain this phenomenon?

Many thanks in advance,

-- Jan.

JuanS_OCS · Posted 01-17-2018 10:38 AM

Hello @jklaverstijn,

not sure if it will help you, since I have no solution/explanation, but perhaps my experience triggers something here, somehow. I guess, if nobody can answer here, a ticket to SAS Technical Support and their question to IBM can solve that problem. Just stating the obvious.

I have seen that behaviour before, actually happening more often as you describe, but perhaps also because it was on previous versions of LSF. Hence, I discarded pretty much monitoring and reporting based on job ids provided by LSF.

I do like solutions as simple as possible for many reasons. Therefore, instead, in that case years ago, I included a few simple macros as pre-post job executions and it did the job pretty well, sometimes even better of what LSF could provide.

Nowadays, I do look towards other tools to monitor the jobs, and I follow closely their developments, as much as I can. A company that catches my interest nowadays is Boemska ( so perhaps @boemskats can tell you more about his ideas ).

boemskats · Posted 01-18-2018 05:35 PM

Thanks for the mention Juan! So, for better or for worse, I'm far, far, far from an LSF expert (or fan :/). But with that in mind, here's how we approach the jobID / unique key issue:

We tend to ignore the LSB_JOBID variable entirely, although we pick up the $LSB_JOBNAME var so that we can record the flow/subflow info for each session, which in turn allows us to visualise jobs quite nicely using drillable treemaps. Instead of using LSF generated IDs, we generate a 'UID' for each session by sourcing our esmconfig.sh file and generating a variable whenever each lev's appservercontext_env_usermods.sh file is sourced. Historically this was a variant of something like export ESMGUID=$(date +'%s')$$_$(hostname), which near enough ensured a unique key for each session started from each server in our GRID; this creates a unique key of [this second][this pid][on this host]. What we would then do inside our autoexec code is this:

newguid = "&ESMGUID";
if envlen('ESMPARENTUUID') > 0 then do; 
  /* This suggests that this variable was set by a parent session so this is a child */
  currentguid = sysget('ESMPARENTUUID');
  newUUIDstring = cats(currentguid, '-', newguid);
  call symputx('ESMJOBUUID',newUUIDstring);
end;
else do; 
  /* This suggests that this session is the main job that collects all the RCs */
  call symputx('ESMJOBUUID',newguid);
end;

options set=ESMPARENTUUID="&ESMJOBUUID.";

What this gives us is a good way of guessing, within the autoexec, whether the process that just started is a 'parent' session for a job, or a child (GRID) subsession, the performance data for which should be reconciled with that of the parent session. It's the best mechanism we have, so far, of building a tree of GRID subsessions (by parsing on the '-' character separators), while maintaining a linkable, unique UUID for each session.

Finally, worth mentioning that last year we moved away from using `ESMGUID=$(date +'%s')$$_$(hostname)` as our identifier generator, instead using a 1-liner java UUIDgen program to generate something that's guaranteed to be unique. Less human readable than the old format and no longer sortable by date, but it increases the chances of 'uniqueness' further (considerably); it's unlikely that a kernel would assign the same pid to two processes on the same host within the same second, but it's even more unlikely that UUID duplication would occur on that same host, ever.

Don't know how much this helps you, but it's an answer 🙂

Nik

bheinsius · Posted 01-19-2018 11:37 AM

Hi Jan,

I recall there is a maximum value of LSB_JOBID and when it surpasses that value it restarts at 1.

I searched on the internet but could not find documentation though; I may be completely wrong.

Regards,

Bart

Resa · Posted 01-19-2018 04:24 PM

Hi @jklaverstijn

If the reset was caused by the fact that the maximum job ID was reached, as indicated by @bheinsius, then you could try to increase the maximum job ID value.

According to the documentation (here is the info for LSF 9.1.3) you can set a MAX_JOB_ID value in:

<LSF_CONFDIR>/<cluster name>/configdir/lsb.params

By default it is set to 999999 but you can increase this significantly.

If the reset was caused by another reason, the solution as described by @boemskats might be the way to go.

Kind regards,

--Resa

jklaverstijn · Posted 01-21-2018 07:20 AM

Hi @Resa and @bheinsius, thanks for pointing out the max_jobid setting. Unfortunately this was not the culprit. This is a fairly recent cluster where the lsb_jobid had not even reached 30000 yet. Otherwise it would have been too easy.

I am not yet at a point where I am willing to start programming to adopt my code to a counter reset, which is as I mentioned a rare event. I know our batch monitor (DIMON from EOM Data) is deployed at other sites and no one has felt the need to do this as far as I know. Combining lsb_jobid and job_start_datetime as primary key or adding a surrogate key as @boemskats suggested would be safer but the impact on the code and datamodel is non-trivial.

And even then, I would still be left with the nagging feeling that something happened and can happen again that I cannot explain or prevent.

Think I will track this with SAS Support. They have a line open to IBM for LSF stuff. Will keep you posted if anything useful comes out of that.

Regards, Jan.

Resa · Posted 01-21-2018 09:47 AM

Hi @jklaverstijn
I fully understand what you indicated.
Even when programming around the reset you’ll still be working around the symptom without understanding the root cause.

Curious if a root cause can be found and whether or not the reset can be prevented (without having reached the maximum job ID value).

—Resa