The key point not discussed enough is that outages happen as the code is changing. If you stop deploy new changes, big FAANGs basically wont go down. Obviously they are so complex thats hard to do, but slowing the rate of feature development will slow the rate of failure. And its probably not a linear relationship
Large platforms have bugs in production, right now. (Because bugs are inevitable.)
And they also have dedicated antagonists who are looking for vulnerabilities to exploit for intelligence or money-making purposes.
So, code is changing not just because of new feature development. It’s also changing as bugs are found and squashed. If you freeze code, you also freeze whatever bugs are there now… giving adversaries longer and longer to find and exploit them.
Capacity or hardware failure is not the only reason a platform could go down. And, maybe more importantly, general uptime is not the only metric of success for a large platform. They also need to keep user data secure, process transactions promptly and correctly, maintain accurate records required for compliance, etc.
Outages happen because of code changes, sure. But they also happen because of hardware changes, some of which is possible to guard against and some isn't really. They also happen because of changes in use that expose issues that were already there but nobody noticed.
They also can happen when resource leaks add up as cleanup that used to be routine isn't done. In teams that push weekly by restarting, they may never realize they've got a (whatever) leak that will kill the servers in 6 weeks; because they probably push often enough that it's never noticed; and testing for slow leaks is hard.