Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's one take. However, in this case, the problem is even more basic - and ridiculous. By all accounts, every system that received this update, died to the BSOD. How is that possible?

It means that the update was not tested, not even once. It certainly was not tested in multiple environments, with multiple configurations, as must be standard for kernel-level software.

This isn't due to "leet code". This is a fundamental process failure. It should not be possible to push out an untested update.

How was it possible? Will have the guts to ever explain? Probably not...



There are many, many possibilities for what could have happened which need to be investigated. Some of which would be surprisingly hard to test for.

1. Did the CDN have a failing disk? A full null read sometimes happens with failing drives.

2. Did the disk holding the update fail just before the CDN upload? (I.e. did the deployment script successfully upload a failed read?)

3. Did the CDN upload fail, but a different process thought it succeeded, activating distribution?

4. Did their updates conform to an internal standard which is later serialized or minified for public distribution, and something broke in the serializer or compression tool?

It is completely possible that the update was tested, good to go, and there was a distribution failure.

However, I do know what Cloudstrike could have done, and should do in the future:

- Staged rollouts, even 15 minutes would help as a gap

- More testing (duh)

- Perhaps most importantly, improving the kernel driver to never crash from any possible input using a fuzzer, but continue booting and warn userspace of failure


It wasn’t corrupted. It was bad code. Bad code which clearly wasn’t tested.


Really? We know, for a fact, yesterday, that the “update” was a file full of complete null characters, unless there was some update on that.

https://news.ycombinator.com/item?id=41009740

That’s quite open to the possibility of a disk failure or CDN failure. Bad code would at least show something that can’t be executed.


official company statement says that null characters were not cause of the problem




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: