Both systems pick different trade-offs: Flink doesn't persist intermediate state...

Both systems pick different trade-offs:

Flink doesn't persist intermediate state synchronously at all. It runs asynchronous global snapshots in the background, which avoid capturing in-flight messages, just store state, aligned through epoch markers (a synchronization step). On a failure, typically seconds of work need to be redone. That's fine, because it is for analytics, and that approach results in good throughput.

Restate won't start step 2 of a sequence before step 1's result is durable, so it needs to make sure that this durability is achieved quickly. It does frequent (batched) log appends, and each partition does that by itself without synchronizing with others. The result is faster (latency) because it is more fine grained and has less coordination, but it is also more work that is done.