Ops engineer here. This is a particularly hard case because the problem involved an interaction of components across the network, and was scale-dependent. These kinds of problems are truly "emergent" in that they're enormously hard to test for. Absent an exact copy of production, with the same workload, I/O characteristics, network latencies, etc., there are always some class of scale/performance-related bugs you just won't catch until the code hits production.
One defense is a "canary" deployment process (they used the term "flighting") to ensure major changes are rolled out slowly enough to detect major performance shifts. Had their deployment process worked correctly, they may have been able to roll back the change without incident.
A second defense is proactively building "safeties" and "blowoff valves" into your software. Example: if a client notices a huge spike in errors, back off before retrying a connection request, otherwise you may put the system into a positive feedback loop. Ethernet collision detection/avoidance is a great example of a safety mechanism done well.
Finally, every high-scale domain has its own problems, which experienced engineers know to worry about. In my case, at an analytics provider, one of the hardest problems we face is data retention: how much to store, at what granularity, for how long, and how that interacts with our various plan tiers. OTOH we have significant latitude to be "eventually correct" or "eventually consistent" in a way a bank, stock exchange, or other transactional financial system (e.g. credit approval) can't be. I imagine other things like ad serving, video serving, game backend development, etc. there are similar "gotchas", but I don't know what they are.
One defense is a "canary" deployment process (they used the term "flighting") to ensure major changes are rolled out slowly enough to detect major performance shifts. Had their deployment process worked correctly, they may have been able to roll back the change without incident.
A second defense is proactively building "safeties" and "blowoff valves" into your software. Example: if a client notices a huge spike in errors, back off before retrying a connection request, otherwise you may put the system into a positive feedback loop. Ethernet collision detection/avoidance is a great example of a safety mechanism done well.
Finally, every high-scale domain has its own problems, which experienced engineers know to worry about. In my case, at an analytics provider, one of the hardest problems we face is data retention: how much to store, at what granularity, for how long, and how that interacts with our various plan tiers. OTOH we have significant latitude to be "eventually correct" or "eventually consistent" in a way a bank, stock exchange, or other transactional financial system (e.g. credit approval) can't be. I imagine other things like ad serving, video serving, game backend development, etc. there are similar "gotchas", but I don't know what they are.