Interesting approach. One thing I've been thinking about with agent review UIs is the state representation problem or how do you diff what the agent "knew" at step N vs step N+1? If you can serialize the agent's cognitive state at each decision point (not just the code output), you can build much richer "why did it do that?" explanations.
Do you support rollback — i.e., if a reviewer rejects step 5, can the agent resume from step 4's state without replaying the whole chain?
The formal verification angle is what makes this interesting. Most coding agents optimize for "code that compiles and passes tests" — that's a low bar. Curious whether the proof artifacts are persisted for audit trails or thrown away after verification.
The Nano tier is the one I'm watching. For agent workflows where you're making dozens of LLM calls per task, the cost per call matters more than peak capability. Would be interesting to see benchmarks on function calling latency specifically — that's what matters for agents.
Do you support rollback — i.e., if a reviewer rejects step 5, can the agent resume from step 4's state without replaying the whole chain?
reply