I suspect the same but I wonder how effective it will be in reality. I mention a group that has been trying to verify SQL for 12 years, and they haven't fully formalised all basic queries.
I think LLMs will speed things up but will it result in popular libraries becoming verified or an entirely new class of libraries being built and verified from the ground up?
I do use ACL2, although I don’t do many proofs. When I do, the proofs usually go through automatically or require me to state only a few lemmas or tell the tool how to do the induction.
This is partially a commentary on how good the automation is in ACL2 and partially a commentary on the fact that I don’t use it for hard problems :)
I use it more for property based testing. The counterexample generator is very powerful because it leverages the knowledge base of the theorem prover (all the accumulated lemmas, etc.) among other things.
> The counterexample generator is very powerful because it leverages the knowledge base of the theorem prover (all the accumulated lemmas, etc.) among other things.
That's a really interesting application. Could be very powerful for what I'm doing. Thank you!
Happy to help. In general, I think counterexample generation is more important than proofs when it comes to software (most of our software is “trivially” wrong). The world might not be ready for full-blown formal verification, but I think it is ready for formal specification and counterexample generation.
Author here. I picked this problem because it was assigned to me in Jira. Energy usage attribution logic for smart EV charging incentives in Python/Django. Pure arithmetic, clear postconditions. Best possible case for formal verification. If it couldn't prove its value here, it couldn't prove it anywhere.
So I built a Claude Code plugin called Crosscheck that uses Dafny (backed by Z3). Five proof obligations, all discharged on the first attempt. The math was trivially correct.
Then I tried to integrate it and everything fell apart. Four bugs, none expressible in any verification language. Decimal precision: the function computed to 6dp, the Django model stored 3dp. Enum coercion: `session.type.value` returned `int 0` instead of `"SMART"`. A test factory that didn't set `transaction_period`, so the function silently returned early. Test passed, code did nothing. And a custom TestCase base class that blocked Decimal comparisons entirely. Two were mismatches between components that individually worked fine. Two were test theatre: the exact failure mode I'd built verification to escape.
The postconditions turned out to be useful in a way I didn't expect: as property test oracles at the integration boundary. A Hypothesis test from `period1 + period2 == total` catches the silent skip immediately and bypasses the broken assertEqual. The spec bridges the gap: proven in Dafny, enforced in Python.
I also ran static analysis across 14 open-source codebases (2.5M lines). Pure, verifiable functions: 22-27% of code, remarkably stable across Python and TypeScript. Kleppmann predicted (https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...) that AI will make formal verification go mainstream. I think he's right about the provers, but the mainstream ceiling is a quarter of the codebase.
I'm not surprised. I used the AI DJ twice, on separate occasions, and it played me the same songs, in the same order... Suffice to say I have not used it since.
That’s just not comparable. I’ll use your figure: If you use 400KB RAM for a process, using the remaining 240KB for something else doesn’t degrade the performance of the initial process (assuming nothing is trying to use more than the available RAM). Each unit of RAM is independent no?
Every token of context you use causes a drop in LLM performance.
More RAM == more processes can be supported with no degradation
More context != more stuff can be done with no degradation
> MARS also contains an AI-supported software brain called MindShare which not only replaces the missing pilot, but is also capable of coordinating entire mission groups by being distributed across many manned and uncrewed platforms.
I think LLMs will speed things up but will it result in popular libraries becoming verified or an entirely new class of libraries being built and verified from the ground up?
ACL2 sounds interesting. Do you use it?
reply