This post seems to be published in a hurry. Under "How it works" section a bunch of duplication, and I think they should make the blog post exactly once :) Excerpt from the blog post:
> During replay, your code runs from the beginning but skips over completed checkpoints, using stored results instead of re-executing completed operations. This replay mechanism ensures consistency while enabling long-running executions.
>
> ... During replay, your code runs from the beginning but skips over completed checkpoints, using stored results instead of re-executing completed operations. This replay mechanism ensures consistency while enabling long-running executions.
I really enjoyed this post and love seeing more lightweight approaches! The deep dive on tradeoffs between different durable-execution approaches was great. For me, the most interesting part is that Persistasaurus (cool name btw) use of bytecode generation via ByteBuddy is a clever way to improve DX: it can transparently intercept step functions and capture execution state without requiring explicit API calls.
(Disclosure: I work on DBOS [1]) The author's point about the friction from explicit step wrappers is fair, as we don't use bytecode generation today, but we're actively exploring it to improve DX.
> The author's point about the friction from explicit step wrappers is fair, as we don't use bytecode generation today, but we're actively exploring it to improve DX.
There is value in such a wrapper/call at invocation time instead of using the proxy pattern. Specifically, it makes it very clear to both the code author and code reader that this is not a normal method invocation. This is important because it is very common to perform normal method invocations and the caller needs to author code knowing the difference. Java developers, perhaps more than most, likely prefer such invocation explicitness over a JVM agent doing byte code manip.
There is also another reason for preferring a wrapped-like approach - providing options. If you need to provide options (say timeout info) from the call site, it is hard to do if your call is limited to the signature of the implementation and options will have to be provided in a different place.
I'm still swinging back and forth which approach I ultimately prefer.
As stated in the post, I like how the proxy approach largely avoids any API dependency. I'd also argue that Java developers actually are very familiar with this kind of implicit enrichment of behaviors and execution semantics (e.g. transaction management is weaved into applications that way in Spring or Quarkus applications).
But there's also limits to this in regards to flexibility. For example, if you wanted to delay a method for a dynamically determined period of time, rather than for a fixed time, the annotation-based approach would fall short.
At Temporal, for Java we did a hybrid approach of what you have. Specifically, we do the java.lang.reflect.Proxy approach, but the user has to make a call instantiating it from the implementation. This allows users to provide those options at proxy creation time and not require they configure a build step. I can't speak for all JVM people, but I get nervous if I have to use a library that requires an agent or annotation processor.
Also, since Temporal activity invocations are (often) remote, many times a user may only have the definition/contract of the "step" (aka activity in Temporal parlance) without a body. Finally, many times users _start_ the "step", not just _execute_ it, which means it needs to return a promise/future/task. Sure this can be wrapped in a suspended virtual thread, but it makes reasoning about things like cancellation harder, and from a client-not-workflow POV, it makes it harder to reattach to an invocation in a type-safe way to, say, wait for the result of something started elsewhere.
We did the same proxying approach for TypeScript, but we saw as we got to Python, .NET, and Ruby that being able to _reference_ a "step" while also providing options and having many overloads/approaches of invoking that step has benefits.
To build on what Peter said, we need to stay focused and make one backend solid before expanding. But architecturally, nothing prevents us from supporting more engines going forward.
I'm excited about this because durable workflows are really important for making AI applications production ready :) Disclaimer: I'm working on DBOS, a durable workflow library built on Postgres, which looks complementary to this.
I asked their main developer Dillon about the data/durability layer and also the compilation step. I wonder if adding a "DBOS World" will be feasible. That way, you get Postgres-backed durable workflows, queues, messaging, streams, etc all in one package, while the "use workflow" interface remains the same.
Here is the response from Dillon, and I hope it's useful for the discussion here:
> "The primary datastore is dynamodb and is designed to scale to support tens of thousands of v0 size tenants running hundreds of thousands of concurrent workflows and steps."
> "That being said, you don't need to use Vercel as a backend to use the workflow SDK - we have created a interface for anyone to implements called 'World' that you can use any tech stack for https://github.com/vercel/workflow/blob/main/packages/world/..."
> "you will require a compiler step as that's what picks up 'use workflow' and 'use step` and applies source transformations. The node.js run time limitations only apply to the outer wrapper function w/ `use workflow`"
DBOS naturally scales to distributed environments, with many processes/servers per application and many applications running together. The key idea is to use the database concurrency control to coordinate multiple processes. [1]
When a DBOS workflow starts, it’s tagged with the version of the application process that launched it. This way, you can safely change workflow code without breaking existing ones. They'll continue running on the older version. As a result, rolling updates become easy and safe. [2]
So applications continuously poll the database for work? Have you done any benchmarking to evaluate the throughput of DBOS when running many workflows, activities, etc.?
In DBOS, workflows can be invoked directly as normal function calls or enqueued. Direct calls don't require any polling. For queued workflows, each process runs a lightweight polling thread that checks for new work using `SELECT ... FOR UPDATE SKIP LOCKED` with exponential backoffs to prevent contentions, so many concurrent workers can poll efficiently. We recently wrote a blog post on durable workflows, queues, and optimizations: https://www.dbos.dev/blog/why-postgres-durable-execution
Throughput mainly comes down to database writes: executing a workflow = 2 writes (input + output), each step = 1 write. A single Postgres instance can typically handle thousands of writes per second, and a larger one can handle tens of thousands (or even more, depending on your workload size). If you need more capacity, you can shard your app across multiple Postgres servers.
I think one potential concern with "checkpoint execution state at every interaction with the outside world" is the size of the checkpoints. Allowing users to control the granularity by explicitly specifying the scope of each step seems like a more flexible model. For example, you can group multiple external interactions into a single step and only checkpoint the final result, avoiding the overhead of saving intermediate data. If you want finer granularity, you can instead declare each external interaction as its own step.
Plus, if the crash happens in the outside world (where you have no control), then checkpointing at finer granularity won't help.
Sure you get more control with explicit state management. But it’s also more work, and more difficult work. You can do a lot of writes to NVMe for one developer salary.
It's not really more work to be explicit about the steps and workflows. You already have to break your code into steps to make your program run. Adding a single decorator isn't much extra work at all.
I think a clearer way to think about this is "at least once" message delivery plus idempotent workflow execution is effectively exactly-once event processing.
The DBOS workflow execution itself is idempotent (assume each step is idempotent). When DBOS starts a workflow, the "start" (workflow inputs) is durably logged first. If the app crashes, on restart, DBOS reloads from Postgres and resumes from the last completed step. Steps are checkpointed so they don't re-run once recorded.
> During replay, your code runs from the beginning but skips over completed checkpoints, using stored results instead of re-executing completed operations. This replay mechanism ensures consistency while enabling long-running executions. > > ... During replay, your code runs from the beginning but skips over completed checkpoints, using stored results instead of re-executing completed operations. This replay mechanism ensures consistency while enabling long-running executions.