"The model generated totally different output every time you ran it, even with fixed seeds." - I remember seeing code takedowns of the model from anti-lockdown people who repeatedly cite this issue.
But there is a valid reason for this to happen, and it doesn't mean bugs in the code. If the code is run in a distributed way (multiple threads, processes or machines), which it was, the order of execution is never guaranteed. So even setting the seed will produce a different set of results if the outcomes of each separate instance depend on each other further in the computation.
There are ways to mitigate this, depending on the situation and the amount of slowdown that's acceptable. Since this model was collecting outcomes to create a statistical distribution, rather than a single deterministic number, it didn't need to.
The fact the model was also drawing distributions also will be why different runs produced possibly vastly different results. Those results would be at different ends of a distribution. Only distributions are sampled and used, not single numbers.
Regarding the GCSE level comment, my concern was the opposite, that the model was trying to model too much, and that inaccuracies would build up. No model is perfect (including this one) and the more assumptions made the larger the room for error. But they validated the model with some simpler models as a sanity check.
My view on criticisms of the model were that they were more politically motivated, and the code takedowns were done by people who may have been good coders, but didn't know enough about statistical modeling.
> "The model generated totally different output every time you ran it, even with fixed seeds." - I remember seeing code takedowns of the model from anti-lockdown people who repeatedly cite this issue.
But there is a valid reason for this to happen, and it doesn't mean bugs in the code. If the code is run in a distributed way (multiple threads, processes or machines), which it was, the order of execution is never guaranteed.
Then there's literally no point to using PRNG seeding. The whole point of PRNG seeding is so you can define some model in terms of "def model(inputs, state) ->" output, and get the "same" output for the same input. I put "same" in quotes because defining sameness on FP hardware is challenging. But usually 0.001% relative tolerance is sufficiently "same" to account for FP implementation weirdness.
If you can't do that, then your model is not a pure function, in which case setting the seed is pointless at best, and biasing/false sense of security in the worst case.
As you mention, non-pure models have their place, but reproducing their results is very challenging, and requires generating distributions with error bars - you essentially "zoom out" until you are a pure function again, with respect to aggregate statistics.
It does not sound like this model was "zoomed out" enough to provide adequate confidence intervals such that you could run the simulation, and statistically guarantee you'd get a result in-bounds.
I reckon the PRNG seeding in such a case might be used during development/testing.
So, run the code with a seed in a non distributed way (e.g. in R turn off all parallelism), and then the results should be the same in every run.
Then once this test output is validated, depending on the nature of the model, it can be run in parallel, and guarantees of deterministic behaviour will go, but that's ok.
I didn't develop the model, so can't really say anything in depth beyond the published materials.
I just found it odd at the time how this specific detail was incorrectly used by some to claim the model was broken/irredeemably buggy.
Edit: Actually, in general, perhaps there's one other situation where the seed might be useful, assuming you have used a seed in the first place. Depending on the distributed environment, there's no guarantee that the processes or random number draws will be run in the same order. But it might be that in most cases they're in the same order. This might bias the distribution of the samples you take. So you might want to change the seed on every run to protect yourself from such nasty phantom effects.
My understanding is the bugginess is due to* unintended* nondeterminism, in other words, things like a race condition where two threads write some result to the same memory address, or singularities/epsilon error in floating point calculations leading to diverging results.
Make no bones about it, these are programming faults. There's no reason why distributed, long-running models can't produce convergent results with a high degree of determinism given the input state. But this takes some amount of care and attention.
> So you might want to change the seed on every run to protect yourself from such nasty phantom effects.
That's a perfect example of what I mean where a seed is actually worse. If you know you can't control determinism, then you might as well go for the opposite: ensure your randomness is high quality enough such that it approximates a perfect uniform distribution. Adding a seed here means you are more likely to capture the true distribution of the output.
The other takedown reviews focused on the fact that there was non-determinism despite a seed, without understanding that's not necessarily a problem.
Agreed on the second point about not having a seed, but I added the "assuming you have used a seed" caveat because sometimes people do use the seed for some reproducible execution modes (even multi-thread/process ones), which are fine, and it's just easier to randomly vary the seed generation rather than remove it all together when running in a non-deterministic mode.
"There are ways to mitigate this, but since the model was collecting outcomes to create a statistical distribution, rather than a single deterministic number, it didn't need to."
This is the justification the academics used - because we're modelling probability distributions, bugs don't matter. Sorry, but no, this is 100% wrong. Doing statistics is not a get-out-of-jail-free card for arbitrary levels of bugginess.
Firstly, the program wasn't generating a probability distribution as you claim. It produced a single set of numbers on each run. To the extent the team generated confidence intervals at all (which for Report 9 I don't think they did), it was by running the app several times and then claiming the variance in the results represented the underlying uncertainty of the data, when in reality it was representing their inability to write code properly.
Secondly, remember that this model was being used to drive policy. How many hospitals shall we build? If you run the model and say 10, and then someone makes the graph formatting more helpful, reruns it and it now it says 4, that's a massive real-world difference. Nobody outside of academia thinks it's acceptable to just shrug and say, well it's just probability so it's OK for the answers to just wildly thrash around like that.
Thirdly, such bugs make unit testing of your code impossible. You can't prove the correctness of a sub-calculation because it's incorporating kernel scheduling decisions into the output. Sure enough Ferguson's model had no functioning tests. If it did, they might have been able to detect all the non-threading related bugs.
Finally, this "justification" breeds a culture of irresponsibility and it's exactly that endemic culture that's destroying people's confidence in science. You can easily write mathematical software that's correctly reproducible. They weren't able to do it due to a lack of care and competence. Once someone gave them this wafer thin intellectual sounding argument for why scientific reproducibility doesn't matter they started blowing off all types of bugs with your argument, including bugs like out of bounds array reads. This culture is widespread - I've talked to other programmers who worked in epidemiology and they told me about things like pointers being accidentally used in place of dereferenced values in calculations. That model had been used to support hundreds of papers. When the bugs were pointed out, the researchers lied and claimed that in only 20 minutes they'd checked all the results and the bugs had no impact on any of them.
Once a team goes down the route of "our bugs are just CIs on a probability distribution" they have lost the plot and their work deserves to be classed as dangerous misinformation.
"My view on criticisms of the model were that it was more politically motivated"
Are you an academic? Because that's exactly the take they like to always use - any criticism of academia is "political" or "ideological". But expecting academics to produce work that isn't filled with bugs isn't politically motivated. It's basic stuff. For as long as people defend this obvious incompetence, people's trust in science will correctly continue to plummet.
If you check the commit history, you'll see that he quite obviously didn't work with the code much at all. Regardless, if he thinks the model is not worthless, he's wrong. Anyone who reviews their bug tracker can see that immediately.
> To the extent the team generated confidence intervals at all (which for Report 9 I don't think they did), it was by running the app several times and then claiming the variance in the results represented the underlying uncertainty of the data, when in reality it was representing their inability to write code properly.
Functionally, what's the difference? The output of their model varied based on environmental factors (how the OS chose to schedule things). The lower-order bits of some of the values got corrupted, due to floating-point errors. In essence, their model had noise, bias, and lower precision than a floating point number – all things that scientists are used to.
Scientists are used to some level of unavoidable noise from experiments done on the natural world because the natural world is not fully controllable. Thus they are expected to work hard to minimize the uncertainty in their measurements, then characterize what's left and take that into account in their calculations.
They are not expected to make beginner level mistakes when solving simple mathematical equations. Avoidable errors introduced by doing their maths wrong is fundamentally different to unavoidable measurement uncertainty. The whole point of doing simulations in silico is to avoid the problems of the natural world and give you a fully controllable and precisely measurable environment, in which you can re-run the simulation whilst altering only a single variable. That's the justification for creating these sorts of models in the first place!
Perhaps you think the errors were small. The errors in their model due to their bugs were of the same order of magnitude as the predictions themselves. They knew this but presented the outputs to the government as "science" anyway, then systematically attacked the character and motives of anyone who pointed out they were making mistakes. Every single member of that team should have been fired years ago, yet instead what happened is the attitude you're displaying here: a widespread argument that scientists shouldn't be or can't be held to the quality standards we expect of a $10 video game.
How can anyone trust the output of "science" when this attitude is so widespread? We wouldn't accept this kind of argument from people in any other field.
At the time, critics of the model were claiming the model was buggy because multiple runs would produce different results. My comment above explains why that is not evidence for the model being buggy.
Report 9 talks about parameters being modeled as probability distributions, i.e. its a stochastic model. I doubt they would draw conclusions from a single run, as the code is drawing a single sample from a probability distribution. And, if you look at the paper describing the original model (cited in report 9), they do test the model with multiple runs. On top of that they perform sensitivity analyses to check erroneous assumptions aren't driving the model.
I have spent time in academia, but I'm not an academic, and don't feel any obligation to fly the flag for academia.
Regarding the politics, contrast how the people who forensically examined Ferguson's papers were so ready to accept the competing (and clearly incorrect https://www.youtube.com/watch?v=DKh6kJ-RSMI) results from Sunetra Gupta's group.
Fair point about academic code being messy. It's a big issue, but the incentives are not there at the moment to write quality code. I assume you're a programmer - if you wanted to be the change you want to see, you could join an academic group, reduce your salary by 3x-4x, and be in a place where what you do is not a priority.
Your comment above is wrong. Sorry, let me try to explain again. Let's put the whole fact that random bugs != stochastic modelling to one side. I don't quite understand why this is so hard to understand but, let's shelve it for a moment.
ICL likes to claim their model is stochastic. Unfortunately that's just one of many things they said that turned out to be untrue.
The Ferguson model isn't stochastic. They claim it is because they don't understand modelling or programming. It's actually an ordinary agent-like simulation of the type you'd find in any city builder video game, and thus each time you run it you get exactly one set of outputs, not a probability distribution. They think it's "stochastic" because you can specify different PRNG seeds on the command line.
If they ran it many times with different PRNG seeds, then this would at least quantify the effect of randomness on their simulation. But, they never did. How do we know this? Several pieces of evidence:
2. The program is so slow that it takes a day to do even a single run of the scenarios in Report 9. To determine CIs for something like this you'd want hundreds of runs at least. You could try and do them all in parallel on a large compute cluster, however, ICL never did that. As far as I understand their original program only ran on a single Windows box they had in their lab - it wasn't really portable and indeed its results change even in single-threaded mode between machines, due to compiler optimizations changing the output depending on whether AVX is available.
3. The "code check" document that falsely claims the model is replicable, states explicitly that "These results are the average of NR=10 runs, rather than just one simulation as used in Report 9."
So, their own collaborators confirmed that they never ran it more than once, and each run produces exactly one line on a graph. Therefore even if you accept the entirely ridiculous argument that it's OK to produce corrupted output if you take the average of multiple runs (it isn't!), they didn't do it anyway.
Finally, as one of the people who forensically examined Ferguson's work, I never accepted Guptra's either (not that this is in any way relevant). She did at least present CIs but they were so wide they boiled down to "we don't know", which seems to be a common failure mode in epidemiology - CIs are presented without being interpreted, such that you can get values like "42% (95% CI 6%-87%)" appearing in papers.
I took a look at point 3. and that extract from the code check is correct. Assuming they did one realisation I was curious why. It would be unlikely to be an oversight.
"Numbers of realisations & computational resources:
It is essential to undertake sufficient realisation to ensure ensemble behaviour of a stochastic is
well characterised for any one set of parameter values. For our past work which examined
extinction probabilities, this necessitates very large numbers of model realizations being
generated. In the current work, only the timing of the initial introduction of virus into a country is
potentially highly variable – once case incidence reaches a few hundred cases per day, dynamics
are much closer to deterministic."
So looks like they did consider the issue, and the number of realisations needed is dependent on the variable of interest in the model. The code check appears to back their justification up,
"Small variations (mostly under 5%) in the numbers were observed between Report 9 and our runs."
The code check shows in their data tables that some variations were 10% or even 25% from the values in Report 9. These are not "small variations", nor would it matter even if they were because it is not OK to present bugs as unimportant measurement noise.
The team's claim that you only need to run it once because the variability was well characterized in the past is also nonsense. They were constantly changing the model. Even if they thought they understood the variance in the output in the past (which they didn't), it was invalidated the moment they changed the model to reflect new data and ideas.
Look, you're trying to justify this without seeming to realize that this is Hacker News. It's a site read mostly by programmers. This team demanded and got incredibly destructive policies on the back of this model, which is garbage. It's the sort of code quality that got Toyota found guilty in court of severe negligence. The fact that academics apparently struggle to understand how serious this is, is by far a faster and better creator of anti-science narratives than anything any blogger could ever write.
I looked at the code check. The one 25% difference is in an intermediate variable (peak beds). The two differences of 10% are 39k deaths vs 43k deaths, and 100k deaths vs 110k deaths. The other differences are less than 5%. I can see why the author of the code check would reach the conclusion he did.
I have given a possible explanation for the variation, that doesn't require buggy code, in my previous comments.
An alternative hypothesis is that it's bug driven, but very competent people (including eminent programmers like John Cormack) seem to have vouched for it on that front. I'd say this puts a high burden of proof on detractors.
But there is a valid reason for this to happen, and it doesn't mean bugs in the code. If the code is run in a distributed way (multiple threads, processes or machines), which it was, the order of execution is never guaranteed. So even setting the seed will produce a different set of results if the outcomes of each separate instance depend on each other further in the computation.
There are ways to mitigate this, depending on the situation and the amount of slowdown that's acceptable. Since this model was collecting outcomes to create a statistical distribution, rather than a single deterministic number, it didn't need to.
The fact the model was also drawing distributions also will be why different runs produced possibly vastly different results. Those results would be at different ends of a distribution. Only distributions are sampled and used, not single numbers.
Regarding the GCSE level comment, my concern was the opposite, that the model was trying to model too much, and that inaccuracies would build up. No model is perfect (including this one) and the more assumptions made the larger the room for error. But they validated the model with some simpler models as a sanity check.
My view on criticisms of the model were that they were more politically motivated, and the code takedowns were done by people who may have been good coders, but didn't know enough about statistical modeling.
Here's John Cormack's take on the model, he saw and cleaned up the original version before it was released. https://mobile.twitter.com/id_aa_carmack/status/125487236876...