Post author here btw, happy to take questions, whether they're about Hegel in particular, property-based testing in general, or some variant on "WTF do you mean you wrote rust bindings to a python library?"
One thing I'm curious about, which I couldn't figure out from a skim of your post, is whether the generated test inputs are random, sequential, or adversarial.
IIRC there are fuzz testers that will analyze the branches of the code to look for edge cases that might break it -- that seems like something that would be wonderful to have in a property tester, but it also seems very difficult to do, especially in a language agnostic way.
How long does it take to find breaking cases like "0/0" or "ß"? Do they pop up immediately, or does it only happen after hundreds or thousands of runs?
They're random but with a lot of tweaks to the distribution that makes weird edge cases pop up with fairly high probability, and with some degree of internal mutation, followed by shrinking to turn them into nice tidy test cases. In Python we do a little bit of code analysis to find interesting constants, but Hegel doesn't do that, it's just tuned to common edge cases.
I think all the examples I had in the post are typically found in the first 100 test cases and reliably found in the first 1000, but I wouldn't swear that that's the case without double checking.
We don't do any coverage-guidance in Hegel or Hypothesis, because for unit testing style workflows it's rarely worth it - it's very hard to do good coverage guidance in under like... 10k test runs at a minimum, 100k is more likely. You don't have enough time to get really good at exploring the state space, and you haven't hit the point where pure random testing has exhausted itself enough that you have to do something smarter to win.
It's been a long-standing desire of mine to figure out a way to use coverage to do better even on short runs, and there are some kinda neat things you can do with it, but we've not found anything really compelling.
You mention in the post that there are design differences between Hegel/Hypothesis and QuickCheck, partly due to attitude differences between Python/non-Haskell programmers and Haskell programmers. As someone coming from the Haskell world (though by no means considering Haskell a perfect language), could you expand on what kinds of differences these are?
So I think a short list of big API differences are something like:
* Hypothesis/Hegel are very much focused on using test assertions rather than a single property that can be true or false. This naturally drives a style that is much more like "normal" testing, but also has the advantage that you can distinguish between different types of failing test. We don't go too hard on this, but both Hegel and Hypothesis will report multiple distinct failures if your test can fail in multiple ways.
* Hegelothesis's data generation and how it interacts with testing is much more flexible and basically fully imperative. You can basically generate whatever data you like wherever in your test you like, freely interleaving data generation and test execution.
* QuickCheck is very much type-first and explicit generators as an afterthought. I think this is mostly a mistake even in Haskell, but in languages where "just wrap your thing in a newtype and define a custom implementation for it" will get you a "did you just tell me to go fuck myself?" response, it's a nonstarter. Hygel is generator first, and you can get the default generator for a type if you want but it's mostly a convenience function with the assumption that you're going to want a real generator specification at some point soon.
From an implementation point of view, and what enables the big conveniences, Hypothesis has a uniform underlying representation of test cases and does all its operations on them. This means you get:
* Test caching (if you rerun a failing test, it will immediately fail in the same way with the previously shrunk example)
* Validity guarantees on shrinking (your shrunk test case will always be ones your generators could have produced. It's a huge footgun in QuickCheck that you can shrink to an invalid test case)
* Automatically improving the quality of your generators, never having to write your own shrinkers, and a whole bunch of other quality of life improvements that the universal representation lets us implement once and users don't have to care about.
The validity thing in particular is a huge pain point for a lot of users of PBT, and is what drove a lot of the core Hypothesis model to make sure that this problem could never happen.
The test caching is because I personally hated rerunning tests and not knowing whether it was just a coincidence that they were passing this time or that the test case had changed.
TBH reading the first few words of that section I was definitely expecting it to continue "so we used Claude to rewrite Hypothesis in Rust..." so that was quite a surprise!
It's on the agenda! We definitely want to rewrite the Hegel core server in rust, but not as much as we wanted to get it working well first.
My personal hope is that we can port most of the Hypothesis test suite to hegel-rust, then point Claude at all the relevant code and tell it to write us a hegel-core in rust with that as its test harness. Liam thinks this isn't going to work, I think it's like... 90% likely to get us close enough to working that we can carry it over the finish line. It's not a small project though. There are a lot of fiddly bits in Hypothesis, and the last time I tried to get Claude to port it to Rust the result was better than I expected but still not good enough to use.
To put it on the record: my position is current models can't get us there, and neither can the next iteration of models, but in two model iterations this will be worth doing. There's a lot of fiddly details in Hypothesis that are critical to get right. You can get a plausible 80% port with agents today but find they've structured it in a way to make it impossible to get to 100%.
Not really a question. Just wanted to express my gratitude for Hypothesis. I use it regularly. A few years back, I had to build a semi-formally-verified fund and account management service, and used the state-based-testing of Hypothesis to validate its correctness. Cannot express how invaluable this little framework has been.
A little while after that, I spoke to someone in the pharma-adjacent-space who was looking at Antithesis to validate their product. At the time, Antithesis (the company) told him that it was a bad fit. I suggested something akin to my previous approach (which did not include antithesis). No clue what they ended up doing, but it is nice to see that Hypothesis and Antithesis have finally joined forces.