Re: Event stream processing - Best practice

darkomarjanovic · Posted 11-02-2017 11:41 AM

Hi,
I have a problem with ESP flow implementation, which costs us a lot of RAM usage. As an input, I have a lot of raw records that are accumulating on daily, 7-days and 30-days level in realtime, and as a result, we have a huge amount of data. The doubt is whether to use stateful windows for these accumulated datasets or some persistent database (some in-memory DB). I would appreciate some best practices based on your experience with ESP, which methodology would suit the use case best? I am also considering that using persistent database between processing steps can help in failover, in case that the whole ESP is down.

Thanks,
Darko

FredCombaneyre · Posted 04-03-2018 07:31 AM

Hi,

The short answer is there are different strategies, depending on what is exactly done with the data, and the required latency.

Some important questions: Why is the data kept for such long retention time? Is it for reference lookup? Pattern detection? Or rolling aggregations? I would guess it is for rolling aggregation. Then the next question is what is the step granularity of the larger rolling aggregations (7 and 30 days): is it per event or per day? If this is per day, the best would probably be to use cascading aggregation, using copy/aggregation and stateful/stateless sequences wisely so we then only keep in memory the events for the last day, and the aggregated values for the weeks and months.

But of course, it also depends on the type of aggregation functions used. Do you require granularity at the event level or can you accommodate aggregating from aggregated levels?

On the other end if there is no need for low latency using a persistent store like a fast database could be a good solution, but then there are more effective ways than a join for doing this. I would basically use a procedural window, except if we can accommodate a much higher latency (> a few seconds) and are not limited by the DB read/write asynchronicity. But we then also need to have more details about what data processing is required to define the best approach.

Hope this help

Fred