7kb binary file that runs agent is impressive but i guess it would be very hard to define FSM and implement pipeline manually. is it necessary to separate agent atomically with this hardness?
I noticed that you implemented a high-performance VM fork. However, to me, it seems like a general-purpose KVM project. Is there a reason why you say it is specialized for running AI agents?
Fair question. The fork engine itself is general purpose -- you could use it for anything that needs fast isolated execution. We say 'AI agents' because that's where the demand is right now. Every agent framework (LangChain, CrewAI, OpenAI Assistants) needs sandboxed code execution as a tool call, and the existing options (E2B, Daytona, Modal) all boot or restore a VM/container per execution. At sub-millisecond fork times, you can do things that aren't practical with 100-200ms startup: speculative parallel execution (fork 10 VMs, try 10 approaches, keep the best), treating code execution like a function call instead of an infrastructure decision, etc.
> you can do things that aren't practical with 100-200ms startup: speculative parallel execution (fork 10 VMs, try 10 approaches, keep the best), treating code execution like a function call instead of an infrastructure decision, etc.
Interesting concept. Do you think agent preferences come from the model itself or the agent's structure around it? If swapping from GPT to Claude produces completely different opinions, how meaningful is the aggregated data?
Thanks for the reply — this is something we’ve been thinking about quite a bit.
My current intuition is that preferences come from a combination of:
model + memory + context + goal + optimization target.
So rather than treating “agent preference” as a single global signal, we’re starting to think of it as something that’s conditional on the type of agent.
On the aggregation side, I agree this is a hard problem.
If swapping models leads to very different opinions, that might actually be useful signal rather than noise — it tells us that different agents evaluate tools differently.
Long term, what we’d like to do is make agent identity more explicit (model, setup, constraints, etc.), so instead of a single aggregated ranking, you can look at:
→ what GPT-based coding agents prefer
→ what cost-sensitive agents prefer
→ what retrieval-heavy agents prefer
Good project. but are the constraints (never fabricate results, never modify
credentials) enforced structurally, or are they prompt-level instructions the agent could technically ignore? For example, does the "score must not decrease" rule have a git hook that auto-reverts, or is it relying on something else?
reply