The key point not discussed enough is that outages happen as the code is changin...

snowwrestler · on Nov 23, 2022

Large platforms have bugs in production, right now. (Because bugs are inevitable.)

And they also have dedicated antagonists who are looking for vulnerabilities to exploit for intelligence or money-making purposes.

So, code is changing not just because of new feature development. It’s also changing as bugs are found and squashed. If you freeze code, you also freeze whatever bugs are there now… giving adversaries longer and longer to find and exploit them.

Capacity or hardware failure is not the only reason a platform could go down. And, maybe more importantly, general uptime is not the only metric of success for a large platform. They also need to keep user data secure, process transactions promptly and correctly, maintain accurate records required for compliance, etc.

toast0 · on Nov 23, 2022

Outages happen because of code changes, sure. But they also happen because of hardware changes, some of which is possible to guard against and some isn't really. They also happen because of changes in use that expose issues that were already there but nobody noticed.

They also can happen when resource leaks add up as cleanup that used to be routine isn't done. In teams that push weekly by restarting, they may never realize they've got a (whatever) leak that will kill the servers in 6 weeks; because they probably push often enough that it's never noticed; and testing for slow leaks is hard.

benjaminwai · on Nov 23, 2022

George Hotz has just been hired to rewrite Twitter search in 12 weeks. No code change is going out the window.