Restoring persisted state after an ESP failover

2 Likes

A highly available SAS ESP architecture consists of 1+N machines that host the SAS ESP server where the ESP project(s) runs. If a failover occurs it is important to bring the failed instance back online as soon as possible. This is especially true if there are only two ESP instances configured (possible but not recommended). Bringing the failed instance back in sync with the existing instances is not an overly-complex exercise but it requires an understanding of key components and steps that need to be taken.

For greater detail on failover, read ESP Failover using Kafka, which explains the components and steps for ESP failover using Kafka as the message bus. In this article, we will review those components and actions required to restore a failed ESP server and bring it in-sync with other running servers.

First a Recap

Before we dig into the restoration process, let's quickly review the failover architecture using Kafka as the message bus.

In the following architecture there are three ESP servers running the same model. There is one active instance and two standby instances. However all are running in-sync in terms of processing events/messages. The active instance is the only instance that is feeding events/messages to the Kafka topic via the subscribe connector and adapter(s). The message ID used to keep track of which event has been processed is managed via ZooKeeper and the subscribe connector. For more in-depth insight, check out the Failover using Kafka article noted above.

Select any image to see a larger version.

Failure

If the active instance fails one of the two remaining standby instances will resume processing. The subscribe connector on the instance transitioning to active knows at which event it should resume processing as ZooKeeper is tracking the message IDs. That standby instance then becomes the active instance.

Persisting and restoring the failed instance

Maintaining high availability requires bringing the failed instance back online and in-sync with the remaining instances as soon as possible. But how is this accomplished?

Persisting

The persist process is the means of saving the state of a running server, either active or standby. This process will save the state of the server to disk at the point it was taken. There are several steps that need to be taken to capture the state.

Stop the running project so no events are flowing through the model
Persist the state to disk
Start the project to resume processing
Start the connectors within the project (the connectors are not started when the project is started)

Each of these steps can be achieved using the ESP client. Here are examples of using the ESP client using the HTTP protocol to accomplish the steps above.

# Stop the running project
dfesp_xml_client -url “http://intesp04.****.***.com:5556/SASESP/projects/dstatkafka/state?value=stopped” -put
# Persist the project to directory /opt/sas/dstat/failoverdfesp_xml_client -url “http://intesp04.****.***.com:5556/SASESP/projects/dstatkafka/state?value=persisted&path=/opt/sas/dstat/failover/” -put
# Start the project
dfesp_xml_client -url “http://intesp04.****.***.com:5556/SASESP/projects/dstatkafka/state?value=running” -put
# Start the connectors (they are not started when the project is started)
dfesp_xml_client -url “http://intesp04.****.***.com:5556/SASESP/projects/dstatkafka/state?value=connectorsStarted” -put

After the connectors have been started the project will be resumed. It may take a few minutes to become fully in sync with the other ESP servers. If the above statements are scripted the output from each of the commands would look similar to the following.

<response>
<message>project dstatkafka successfully stopped</message>
</response>
<response>
<message>
<response>
<message>project dstatkafka successfully saved to /opt/sas/dstat/failover/</message>
</response>
</message>
</response>
<response>
<message>project dstatkafka successfully started</message>
</response>
<response>
<message>connectors started for project 'dstatkafka'</message>
</response>

So what does a persisted directory structure look like? Here is a screen shot of a persisted directory. Everything under the failover directory is created by the persist request.

In this case the size of the directory is 56MB. Obviously this is relatively small but subject to change over time depending on retention policies.

This is something to keep in mind as the larger the model and data retained in the model, the longer it will take and more disk space it will require to store the state.

There are two methods of storing the current state.

Persist the project state locally for each ESP server
Persist the project state to shared storage from a standby server (e.g. NFS mounted file system)

Persist locally

If you persist the state locally for each ESP server it requires that each project be stopped, persisted and restarted. As you can imagine performing these steps on the Active instance will create a "pause" in processing events. The length of this pause will be dependent upon the size of the data persisted and the speed of the storage. This is not ideal as real-time systems will be impacted by the pause. And as you can imagine this should be on a storage that is not susceptible to failure.

This diagram shows each instance persisting the project to local storage.

Note that this diagram does not represent the testing environment referenced throughout the article.

Persist to shared storage

As noted above the alternative is to persist the ESP project from a single standby server to storage shared by each instance. This method has a couple of benefits. First there will only be one copy of the persisted store and each instance can restore from that copy. In addition there is no need to pause the active instance. The one small downside of this method is that automation will need to be built to track the active and standby instances to ensure the project is persisted from a standby instance.

This diagram shows a standby instance persisting the state of a standby instance to shared storage. Although not shown here all three machines have read/write access to the NFS mounted file system.

Note that this diagram does not represent the testing environment referenced throughout the article.

Scheduling persistence

In most deployments the most effective way to ensure that an ESP server is persisted is to schedule it via cron or some other scheduling tool. The commands to persist the state of a standby instance shown earlier can be converted to a simple script. This can be scheduled as needed. Depending on the amount of data processed and system requirements it may be necessary to schedule it multiple times per day. This will reduce the amount of "recovery" time to be in sync with the other instances.

Restoring a persisted project

When an ESP server failure occurs it must be restarted and the project loaded and started to bring the server in-sync with the running instances. Here are the steps for restoring a persisted project.

Start the XML server without loading a project
Load the project but don't start it
Restore the project from the persisted state
Start the project
Start the connectors

Like the persist process all but the first step can be executed using the ESP client. Here are command line statements for each of the steps.

# On intesp03 as sasesp
# Start the ESP XML server but do not load a project
nohup dfesp_xml_server -pubsub 5555 -http 5556 -badevents ~/dstat_badevents.txt > /opt/sas/dstat/intesp03_dstat_`date +%d-%m-%y.%H.%M.%S`.log &
# Load the project but do not start it
dfesp_xml_client -url “http://intesp03.****.***.com:5556/SASESP/projects/dstatkafka?start=false” -put “file:///home/sasesp/dstatkafka.xml”
# Restore the state of the model from the persisted state
dfesp_xml_client -url “http://intesp03.****.***.com:5556/SASESP/projects/dstatkafka/state?value=restored&path=/opt/sas/dstat/failover/” -put
# Start the project
dfesp_xml_client -url “http://intesp03.****.***.com:5556/SASESP/projects/dstatkafka/state?value=running” -put
# Start the associated connectors
dfesp_xml_client -url “http://intesp03.****.***.com:5556/SASESP/projects/dstatkafka/state?value=connectorsStarted” -put

If successful these commands will generate the following responses.

<message>load project dstatkafka succeeded</message>

<response>
<message>project 'dstatkafka' successfully restored from '/opt/sas/dstat/failover/'</message>
</response>

<response>
<message>project dstatkafka successfully started</message>
</response>

<response>
<message>connectors started for project 'dstatkafka'</message>
</response>

Upon reviewing the log from the ESP XML server we can see the progress from above statements. After the project is loaded, restored and started, the server is switched to Standby state but no processing activity is started until the connectors are started. Once the connectors are started processing of events stored in the Kafka topic resumes and communication with ZooKeeper is established allowing it to sync with the other ESP servers.

2019-01-17T13:22:37,289; INFO ; 00000004; DF.ESP; (dfESPengine.cpp:762); [Engine0007] dfESPengine::initialize() dfESPengine version 5.2 completed initialization
2019-01-17T13:22:37,292; INFO ; 00000004; DF.ESP; (Esp.cpp:668); [XMLServer0001] esp engine started, version 5.2, pubsub: 5555
2019-01-17T13:22:37,292; INFO ; 00000004; DF.ESP; (Esp.cpp:791); [XMLServer0001] starting esp server
 
2019-01-17T13:22:37,292; INFO ; 00000010; DF.ESP; (Http.cpp:602); [XMLServer0001] starting HTTP server on port 5556, type=admin
2019-01-17T13:25:19,990; INFO ; 00000013; DF.ESP; (dfESPconnector.cpp:1378); [Connectors0087] dfESPconnector::setState(): Connector in group default has new state: stopped
2019-01-17T13:25:19,992; INFO ; 00000013; DF.ESP; (C_dfESPpubsubApi.cpp:406); [PubSub0056] C_dfESPpubsubInit(): Successfully loaded pubsub plugin libesp_native_ppi-5.2.1.so from /opt/sas/viya/home/SASEventStreamProcessingEngine/5.2/lib/plugins/
2019-01-17T13:25:20,065; INFO ; 00000015; DF.ESP; (dfESPwindow_aggregate.cpp:209); [Aggregate0017] dfESPwindow_aggregate::postInputFinalizer() for window Aggregate1: Aggregate window using additive optimizations
2019-01-17T13:25:20,526; INFO ; 00000016; DF.ESP; (dfESPproject.cpp:790); [Project0009] dfESPproject::startPubSub() for project _meta_: Pub/Sub services are enabled for port 41241, encryption disabled, client authentication disabled
2019-01-17T13:25:20,526; INFO ; 00000016; DF.ESP; (dfESPproject.cpp:632); [Project0016] dfESPproject::finalizeProject() for project dstatkafka: Returned immediately, project was already finalized
2019-01-17T13:25:20,528; INFO ; 00000016; DF.ESP; (dfESPproject.cpp:790); [Project0009] dfESPproject::startPubSub() for project dstatkafka: Pub/Sub services are enabled for port 39038, encryption disabled, client authentication disabled
2019-01-17T13:25:20,601; INFO ; 00000029; DF.ESP; (dfESPkafkaConnector.cpp:2004); [Connectors0049] dfESPkafkaConnector::start(): Hot failover state switched to standby
2019-01-17T13:25:20,602; INFO ; 00000029; DF.ESP; (dfESPconnector.cpp:1378); [Connectors0087] dfESPconnector::setState(): Connector subkafka in group default has new state: running
2019-01-17T13:25:20,604; INFO ; 00000029; DF.ESP; (dfESPconnector.cpp:1378); [Connectors0087] dfESPconnector::setState(): Connector pubkafka in group default has new state: running
2019-01-17T13:25:20,607; INFO ; 00000036; DF.ESP; (dfESPkafkaConnector.cpp:294); [Connectors0098] dfESPkafkaConnector::z_watcher(): Zookeeper watcher event: watcher = SESSION_EVENT, state = CONNECTED_STATE, path =
2019-01-17T13:25:20,609; INFO ; 00000037; DF.ESP; (dfESPkafkaConnector.cpp:331); [Connectors0097] dfESPkafkaConnector::z_watcher(): Created zookeeper node /ESP/server-n_0000000132
2019-01-17T13:25:20,611; INFO ; 00000038; DF.ESP; (dfESPkafkaConnector.cpp:275); [Connectors0096] dfESPkafkaConnector::watchActiveZnode(): Watching zookeeper node /ESP/server-n_0000000131
2019-01-17T13:25:30,894; INFO ; 00000035; DF.ESP; (dfESPclientObj.cpp:585); [PubSub0081] dfESPclientObj::doRateProcessing(): Current publish rate: 3912.458950 events/sec

In this example the persisted store was placed on an NFS-mounted drive. The state was persisted from host intesp04, which was the standby instance before the failure. After the active instance failed on intesp03 the XML server was restarted on intesp03 and the persisted state created from intesp04 was used to restore the state on the newly started instance on intesp03.

Here we see a restored standby instance on the machine where the ESP server failure occurred (sasespsrv01). The restore process used a persisted store on an NFS mounted file system to sync up the ESP server on the first machine. The second machine (sasespsrv02) has become the active instance.

Note that this diagram does not represent the testing environment referenced throughout the article.

Understanding the restore process timeline

Now that the steps for persisting and restoring an ESP server have been outlined, let's take a look at the timeline of events. If we consider the events/messages that are appended to a Kafka topic, those events/messages are retained for a period of time as defined by Kafka system settings. The time window for those events must precede the cycle for the schedule persist. Let's attempt to illustrate this.

The following time line shows a continuous three-day flow. The scheduled persist appears to run early in the morning and the retention within the windows of the project typically contain data for more than a day. The retention for Kafka also appears to keep events for slightly more than a day. But the key is the events in Kafka precede the time when the scheduled persist occurs.

So what happens when a failure occurs? In the following time line we see a failure occurs around noon. But the restore from the persisted state does not occur until evening. This is OK because the restore will be applied to a standby ESP server instance and the events/messages in Kafka with which it needs to sync precede the scheduled persist time. The initial state of the restored instance will be in the state when it was persisted and the events/messages required to sync will be identified in Kafka and processed until in-sync with the other ESP servers.

A noteworthy item

During persist and restore testing I encountered an issue with the ESP 5.2 Kafka connector. After persisting a standby instance, and subsequently starting the project and connectors so that it was processing again, the standby instance failed to transition to active state when the initial active instance was terminated. The issue was identified and will be fixed in ESP 6.1.

Final Thoughts

So there you have it. A way to store the state of an ESP server and then use that stored state to restore the state of a failed server instance. Hope it was of some value to you.

My thanks to Andy Tracy and Andrey Matveenko for their contributions to and input on this article.