We have the TKGrid with 1 master and 4 worker nodes. Recently there were instances where the worker node went down (physical server down) which cause all the data loaded on the LASR Servers become inaccessible which we can understand and comprehend. However, we cannot understand why the LASR server also went down when a worker node in the TKGrid goes down.
I have seen this before, and I would check with your SAS and Hadoop experts, or with SAS Technical Support. The reason is often the configuration of resilence, data parity and replication of the data, although perhaps it could be a new hotfix/patch that needs to be installed either on SAS or the Hadoop filesystem.
BUT WHAT ABOUT LASR? Our LASR Workers are co-located with the Hadoop Data Nodes. So when that one host machine died, we not only lost the Hadoop services on that machine, we also lost any running LASR Worker services. A logical SAS LASR Analytic Server is not robust enough to seamlessly work around the loss of a LASR Worker node. In our failure scenario, the loss of one physical host machine has corrupted the running logical SAS LASR Analytic Server—it lost the subset of data that it had in RAM. There is no option but to manually stop the entire SAS LASR Analytic Server. With LASR not running, the data it was responsible for isn’t available. This means that SAS Visual Analytics cannot serve up any reports that rely on that data. Fortunately, we have the option of re-starting the LASR process. As the Root and remaining Worker Nodes come online, they will automatically form up into the same logical SAS LASR Analytic Server as before, with the difference being that now one node is missing. After our SAS LASR Analytic Server is back online, we can then reload data into it. All of the data we need is still available as SASHDAT tables in HDFS thanks to the built-in availability design of Hadoop.