<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: LSF Job ID gets reset in Administration and Deployment</title>
    <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429469#M11999</link>
    <description>Hi &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12460"&gt;@jklaverstijn&lt;/a&gt;&lt;BR /&gt;I fully understand what you indicated. &lt;BR /&gt;Even when programming around the reset you’ll still be working around the symptom without understanding the root cause. &lt;BR /&gt;&lt;BR /&gt;Curious if a root cause can be found and whether or not the reset can be prevented (without having reached the maximum job ID value). &lt;BR /&gt;&lt;BR /&gt;—Resa</description>
    <pubDate>Sun, 21 Jan 2018 14:47:31 GMT</pubDate>
    <dc:creator>Resa</dc:creator>
    <dc:date>2018-01-21T14:47:31Z</dc:date>
    <item>
      <title>LSF Job ID gets reset</title>
      <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/428417#M11927</link>
      <description>&lt;P&gt;We use IBM LSF 9.1.3 for our batch scheduling and grid management. Now for the second time (in months, over 7 different environments, so it's a rare event) I have noticed that the job id sequence gets reset to 1.The job id is what's found in environment variable LSB_JOBID and displayed by commands like bjobs and bhist.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For LSF itself this is not the end of the world but we collect job data for monitoring and reporting. The job id is part of the key of many tables. It is assumed to be unique but now we end up with duplicate keys which throws our monitoring off.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have gone through the logs and do see errors in the mbatchd and sbatchd logs. Mainly network related. These errors seem correlated due to the time they occur by I fail to come up with causality.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Another noteworthy thing is that the current lsb.events file starts at the time of the reset. It no longer connects with the historic version lsb.events.1. Looks like lsb.events was reset as well.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Has anyone else seen this before? And is anyone familiar with the mechanism behind how the next job id is determined and how we can explain this phenomenon?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Many thanks in advance,&lt;/P&gt;
&lt;P&gt;-- Jan.&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jan 2018 15:27:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/428417#M11927</guid>
      <dc:creator>jklaverstijn</dc:creator>
      <dc:date>2018-01-17T15:27:24Z</dc:date>
    </item>
    <item>
      <title>Re: LSF Job ID gets reset</title>
      <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/428429#M11928</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12460"&gt;@jklaverstijn&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;not sure if it will help you, since I have no solution/explanation, but perhaps my experience triggers something here, somehow. I guess, if nobody can answer here, a ticket to SAS Technical Support and their question to IBM can solve that problem. Just stating the obvious.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have seen that behaviour before, actually happening more often as you describe, but perhaps also because it was on previous versions of LSF. Hence, I discarded pretty much monitoring and reporting based on job ids provided by LSF.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I do like solutions as simple as possible for many reasons. Therefore, instead, in that case years ago, I included a few simple macros as pre-post job executions and it did the job pretty well, sometimes even better of what LSF could provide.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Nowadays, I do look towards other tools to monitor the jobs, and I follow closely their developments, as much as I can.&amp;nbsp;A company that catches my interest nowadays is Boemska ( so perhaps&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/46760"&gt;@boemskats&lt;/a&gt;&amp;nbsp;can tell you more about his ideas&amp;nbsp;).&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jan 2018 15:39:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/428429#M11928</guid>
      <dc:creator>JuanS_OCS</dc:creator>
      <dc:date>2018-01-17T15:39:51Z</dc:date>
    </item>
    <item>
      <title>Re: LSF Job ID gets reset</title>
      <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/428948#M11948</link>
      <description>&lt;P&gt;Thanks for the mention Juan! So, for better or for worse, I'm far, far, far from an LSF expert (or fan :/). But with that in mind, here's how we approach the jobID / unique key issue:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We tend to ignore the LSB_JOBID variable entirely, although we pick up the $LSB_JOBNAME var so that we can record the flow/subflow info for each session, which in turn allows us to visualise jobs quite nicely using drillable treemaps. Instead of using LSF generated IDs, we generate a 'UID' for each session by sourcing our esmconfig.sh file and generating a variable whenever each lev's appservercontext_env_usermods.sh file is sourced. Historically this was a variant of something like &lt;FONT face="courier new,courier"&gt;export ESMGUID=$(date +'%s')$$_$(hostname)&lt;/FONT&gt;, which near enough ensured a unique key for each session started from each server in our GRID; this creates a unique key of &lt;FONT face="courier new,courier"&gt;[this second][this pid][on this host]&lt;/FONT&gt;. What we would then do inside our autoexec code is this:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;newguid = "&amp;amp;ESMGUID";
if envlen('ESMPARENTUUID') &amp;gt; 0 then do; 
  /* This suggests that this variable was set by a parent session so this is a child */
  currentguid = sysget('ESMPARENTUUID');
  newUUIDstring = cats(currentguid, '-', newguid);
  call symputx('ESMJOBUUID',newUUIDstring);
end;
else do; 
  /* This suggests that this session is the main job that collects all the RCs */
  call symputx('ESMJOBUUID',newguid);
end;

options set=ESMPARENTUUID="&amp;amp;ESMJOBUUID.";&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What this gives us is a good way of guessing, within the autoexec, whether the process that just started is a 'parent' session for a job, or a child (GRID) subsession, the performance data for which should be reconciled with that of the parent session. It's the best mechanism we have, so far, of building a tree of GRID subsessions (by parsing on the '-' character separators), while maintaining a linkable, unique UUID for each session.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Finally, worth mentioning that&amp;nbsp;last year we moved away from using `ESMGUID=$(date +'%s')$$_$(hostname)` as our identifier generator, instead using a 1-liner java UUIDgen&amp;nbsp;program&amp;nbsp;to generate something that's guaranteed to be unique. Less human readable than the old format and no longer sortable by date, but it increases the chances of 'uniqueness' further (considerably); it's unlikely that a kernel would assign the same pid to two processes on the same host within the same second, but it's even more unlikely that UUID duplication would occur on that same host, ever.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Don't know how much this helps you, but it's an answer &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;Nik&lt;/P&gt;</description>
      <pubDate>Thu, 18 Jan 2018 22:35:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/428948#M11948</guid>
      <dc:creator>boemskats</dc:creator>
      <dc:date>2018-01-18T22:35:43Z</dc:date>
    </item>
    <item>
      <title>Re: LSF Job ID gets reset</title>
      <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429178#M11979</link>
      <description>&lt;P&gt;Hi Jan,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I recall there is a maximum value of LSB_JOBID and when it surpasses that value it restarts at 1.&lt;/P&gt;&lt;P&gt;I searched on the internet but could not find documentation though; I may be completely wrong.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Bart&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jan 2018 16:37:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429178#M11979</guid>
      <dc:creator>bheinsius</dc:creator>
      <dc:date>2018-01-19T16:37:40Z</dc:date>
    </item>
    <item>
      <title>Re: LSF Job ID gets reset</title>
      <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429287#M11982</link>
      <description>&lt;P&gt;Hi &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12460"&gt;@jklaverstijn&lt;/a&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the reset was caused by the fact that the maximum job ID was reached, as indicated by &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13625"&gt;@bheinsius&lt;/a&gt;, then you could try to increase the maximum job ID value.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;According to the documentation (&lt;A href="https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_config_ref/lsb.params.max_jobid.5.html" target="_blank"&gt;here&lt;/A&gt; is the info for LSF 9.1.3) you can set a MAX_JOB_ID value in:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;FONT face="courier new,courier"&gt; &amp;lt;LSF_CONFDIR&amp;gt;/&amp;lt;cluster name&amp;gt;/configdir/lsb.params&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;By default it is set to 999999 but you can increase this significantly.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the reset was caused by another reason, the solution as described by &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/46760"&gt;@boemskats&lt;/a&gt; might be the way to go.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Kind regards,&lt;/P&gt;
&lt;P&gt;--Resa&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jan 2018 21:24:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429287#M11982</guid>
      <dc:creator>Resa</dc:creator>
      <dc:date>2018-01-19T21:24:46Z</dc:date>
    </item>
    <item>
      <title>Re: LSF Job ID gets reset</title>
      <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429458#M11998</link>
      <description>&lt;P&gt;Hi &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13592"&gt;@Resa&lt;/a&gt; and &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13625"&gt;@bheinsius&lt;/a&gt;, thanks for pointing out the max_jobid setting. Unfortunately this was not the culprit. This is a fairly recent cluster where the lsb_jobid had not even reached 30000 yet. Otherwise it would have been too easy.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am not yet at a point where I am willing to start programming to adopt my code to a counter reset, which is as I mentioned a rare event. I know our batch monitor (DIMON from EOM Data) is deployed at other sites and no one has felt the need to do this as far as I know. Combining lsb_jobid and job_start_datetime as primary key or adding a surrogate key as&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/46760"&gt;@boemskats&lt;/a&gt; suggested would be safer but the impact on the code and datamodel is non-trivial.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;And even then, I would still be left with the nagging feeling that something happened and can happen again that I cannot explain or prevent.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Think I will track this with SAS Support. They have a line open to IBM for LSF stuff. Will keep you posted if anything useful comes out of that.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Regards, Jan.&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jan 2018 12:20:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429458#M11998</guid>
      <dc:creator>jklaverstijn</dc:creator>
      <dc:date>2018-01-21T12:20:29Z</dc:date>
    </item>
    <item>
      <title>Re: LSF Job ID gets reset</title>
      <link>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429469#M11999</link>
      <description>Hi &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/12460"&gt;@jklaverstijn&lt;/a&gt;&lt;BR /&gt;I fully understand what you indicated. &lt;BR /&gt;Even when programming around the reset you’ll still be working around the symptom without understanding the root cause. &lt;BR /&gt;&lt;BR /&gt;Curious if a root cause can be found and whether or not the reset can be prevented (without having reached the maximum job ID value). &lt;BR /&gt;&lt;BR /&gt;—Resa</description>
      <pubDate>Sun, 21 Jan 2018 14:47:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Administration-and-Deployment/LSF-Job-ID-gets-reset/m-p/429469#M11999</guid>
      <dc:creator>Resa</dc:creator>
      <dc:date>2018-01-21T14:47:31Z</dc:date>
    </item>
  </channel>
</rss>

