Re: LASR server down - how to find out why?

MarkESmith · Posted 07-10-2019 02:00 PM

Hi all,

Recently, our LASR server was down unexpectedly. I was able to start it without issue and load tables into it. Unfortunately, I didn't think to look at the 'Last Action Log' before I manually started the LASR server.

Is there any log that would possibly indicated why the LASR server went down? If so, where would it be located?

(We are running SAS on Linux)

Thanks!

alexal · Posted 07-10-2019 02:02 PM

@MarkESmith ,

Is the LASR server running in distributed or non-distributed mode?

MarkESmith · Posted 07-10-2019 02:04 PM

@alexal,

Non-distributed mode

alexal · Posted 07-10-2019 02:08 PM

@MarkESmith ,

How did you start the LASR server? I guess from the VA Administration Console? Did you restart the object spawner before the LASR crash? If not, have you had a high memory utilization on a compute tier that day?

MarkESmith · Posted 07-10-2019 02:19 PM

Yes, I started it in the VA Administration console. I did not restart anything before before the crash and I'm the only one who would, or could, do something like that. I have no hard data to back this up, but I see no reason why memory usage would've been greater than normal yesterday when it happened.

Are there no logs that would give some sort of positive indication as to what happened in a case like this?

alexal · Posted 07-10-2019 02:23 PM

@MarkESmith ,

In your case /var/log/messages, but only if the Linux Kernel have killed the LASR server process. Also, I'm wondering, have you had any errors in the object spawner log yesterday?

MarkESmith · Posted 07-10-2019 03:01 PM

@alexal,

Thanks for your input! I sifted through /var/log/messages and found the following curiosity:

Jul 10 00:11:13 kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jul 10 00:11:13 kernel: [<ffffffff81188ab6>] out_of_memory+0x4b6/0x4f0
Jul 10 00:11:13 kernel: Out of memory: Kill process 12411 (sas) score 889 or sacrifice child
Jul 10 00:11:13 kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jul 10 00:11:13 kernel: [<ffffffff81188ab6>] out_of_memory+0x4b6/0x4f0
Jul 10 00:11:13 kernel: Out of memory: Kill process 12411 (sas) score 889 or sacrifice child

Looks like this could've been possibly caused by an out-of-memory killer. I grep'd the other archived 'messages' log files and no where else has this occurred.If the kernel killed a child process of java, then that definitely could've made the LASR server unstable and thus caused a crash. Doesn't seem like there's any viable way to determine exactly which process the PID in question belonged to (other than the fact that it was a 'sas' process).

I haven't looked through the objectSpawner log yet, but I will.

Edit: I have a somewhat shallow understanding of how out-of-memory kills work, but I believe that the process that invokes it is the parent process that is hogging up memory (in part due to child processes). If this is true, that would mean that Java was the memory hog in my case.

SASKiwi · Posted 07-10-2019 05:06 PM

How often do you reboot your SAS VA servers? We have ours on a monthly reboot schedule and that has certainly helped maintain good reliability and performance. If there are any orphan processes chewing up resources then regular reboots will fix these.

MarkESmith · Posted 07-11-2019 10:06 AM

As of now, we've only been rebooting them when we've experienced problems or have had to restart the physical server. Did you experience problems like this before rebooting the VA servers monthly?

Is it possible that a user-initiated process, like a query, could turn rogue and end up eating up all the memory?

PaulS_ · Posted 07-11-2019 10:24 AM

Over the years, I've found the some SAS processes exhibit signs of memory leak or other "stability" issues, so restarting them periodically seems to be A Good Thing (tm). So every two weeks we stop all SAS processes, then start them all (in the proper order). However, taking advantage of the restart, while all processes are stopped we perform a "cold" backup of all relevant config, data, etc. One can never have too many backups.

MarkESmith · Posted 07-11-2019 02:00 PM

I might have to take your recommendation. This problem just occurred again, and this time I'm going to reboot the whole machine.

After looking at /var/log/messages - it definitely appears to be some sort of memory leak because I just received another 'out-of-memory kill process' error.

SASKiwi · Posted 07-11-2019 04:12 PM

@MarkESmith - If you monitor SAS VA web server memory usage in SAS Environment Manager you will see it gradually increase over time especially spread over several weeks. A regular reboot will drop that back to a starting minimum. It is good server administration as well as we do at the same time OS patches are applied.

MarkESmith · Posted 07-11-2019 04:15 PM

Thanks for the input. I'm hoping that this reboot will mitigate problems for a while and in the meantime, I can work on a scripted way to reboot the machine and start up all the necessary servers.

It appears that the 'LASR Analytic Server' is automatically started either upon reboot or upon execution of the 'sas.servers' script. I assume there must be a way that the 'Public LASR Analytic Server' can be scripted to start as well, correct?

alexal · Posted 07-10-2019 06:24 PM

@MarkESmith ,

Thanks. The LASR server isn't a Java application, but the non-distributed LASR server started from the VA Administration Console will depend on these components:

Object Spawner
Web Server

I'm wondering to see more details about process with ID 12411 if you will be able to find anything in the log files (maybe SAS logs). Anyway, what you've found could potentially kill the LASR server.

MarkESmith · Posted 07-11-2019 10:18 AM

Unfortunately, was not able to find anything illuminating about that Process ID in the log files and nothing seemed suspect in the ObjectSpawner logs either. Have you heard of this out-of-memory problem happening on SAS installations before? I'm just wondering if user-initiated requests could snowball into a problem like this (specifically queries). Our machine has PLENTY of memory, so I can't imagine the constraint lies there.

The other worry is whether this problem is only a symptom indicative of something worse (that may recur) or if this is more likely a one-time thing that I could prevent (or mitigate) in the future by restarting the VA servers on a monthly basis.