Does anyone else see the missing piece to this post mortem? An infinite loop mad...

Someone1234 · on Dec 17, 2014

> An infinite loop made its way onto a majority(? all?) of production servers, and the immediate response is more or less 'we shouldn't have deployed to as many customers, failure should have only happened to a small subset'?

All server software has one or more "infinite loops." It is a fundamental object in all listeners.

Plus when they say infinite loop, I assumed they meant it continuously entered a crash/restart cycle rather than a while(true) {} in a line of code.

I think the reality on the ground is that bug-free software is a myth. All you can do is have processes (like gradual deployment) to mitigate the damage it can do, rather than making it your goal to write the mythical perfect code.

> But this exemplifies a major problem our industry suffers from, in that it just taken as a given that critical errors will sometimes make their way into production servers and the best we can do is reduce the impact. I find this absolutely unacceptable

It is a major problem. It costs billions every year. But what can be done? If there was a magic wand solution I'm sure people would be scrambling to deploy it as it saves them money.

> How about we short circuit the process and identify ways to stop that from happening? Were there enough code reviews? Did automated testing fail here?

This seems like a somewhat naive view of software development in general. Like what I'd call a "mathematician's view," in the sense that they think large complex systems can be reduced to a simple quantifiable process.

Code reviews and more importantly unit tests can help find bugs. But inter-connectivity between large complex systems is harder to test again, and harder to code review (because the bugs don't exist on any single line of code, or in any single block even).

daeken · on Dec 17, 2014

> This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models".

Which would be a pretty reasonable thing to say if you had a large portion of the population on a single plane. What this all comes down to is: problems (small or big; stupidly simple or ridiculously complex) will happen. Isolating problems to the smallest number of people possible is the responsible course of action.

That, of course, isn't mutually exclusive with doing better from a software engineering front.

krschultz · on Dec 17, 2014

"This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models"."

Things break. You can not design something to be perfect. That is why there is redundancy in every critical system. You are better off having your system gracefully recover from failure, than trying to design the system perfectly. That might mean spreading the traffic across multiple instances within one data center so when one breaks (for any reason), the others pick up the slack. The next level is spreading across multiple data centers so if one place goes down, you have another pick up the slack. Arguably the next level would be going across multiple providers, but to me that seems like overkill.

Given that, if Microsoft rolls it out to 5% of servers and those crap out, that is roughly equivalent for individual customers that are properly spread over multiple instances as a spate of harddrive failures. This only breaks down when they roll the broken stuff out to 100% at once.

brc · on Dec 18, 2014

I see this type of comment across a number of fields - why isn't X safer?

Generally is comes down to the 'insurance' argument - why didn't we spend the time (read: money) to test and prevent for X?

The answer corms down to the risk/benefit.

It's possible to insure your house against total loss, against any type of threat. You could build it on the shoreline of a known hurricane location, or on top of an active volcano, and insure it for full replacement. All you have to do is deposit an amount equal to the replacement cost in an account - if you suffer total loss, spend the money and replace the house.

Where people go wrong is by thinking that problems can and should be prevented at any cost. But the issue is that thinking that way leads to excessive costs for the thing in the first place. It would be possible to design a highway system where nobody ever died. However the cost would be so high that very few highways would be built, so the advantages of cheap and easy travel are lost.

Likewise, it would be possible for Microsoft to build an automated testing and checking software that never made a mistake. However that would make azure uncompetitive or unprofitable. It's cheaper just to hire good people and accept that occasionally, something might go wrong.

Some software is actually made to never go wrong. That software is in satellites and mars rovers and the like. Even then mistakes happen due to the nature of complexity and probability. But the cost per line of delivered code for a satellite is orders of magnitude higher than the cost of azure management code.

You really only need to look at problems when the cost of fix is much better than the cost of potential loss. That's why planes are safer than cars - because the loss of a big plane and passengers is a very costly event.

wvenable · on Dec 17, 2014

It's not a missing piece, it's in the release: "Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends."

Not only was it deployed to everyone but it was deployed, untested, to the wrong place. Otherwise deploy would have been just as broad but successful.

fishnchips · on Dec 17, 2014

> Were there enough code reviews? Did automated testing fail here?

I really do love tests and all but they only get you so far. In fact you're way more often bitten by things that are outside of your frame of reference and therefore these are not the ones you take into account when designing testing pipeline.

eldavido · on Dec 18, 2014

Ops engineer here. This is a particularly hard case because the problem involved an interaction of components across the network, and was scale-dependent. These kinds of problems are truly "emergent" in that they're enormously hard to test for. Absent an exact copy of production, with the same workload, I/O characteristics, network latencies, etc., there are always some class of scale/performance-related bugs you just won't catch until the code hits production.

One defense is a "canary" deployment process (they used the term "flighting") to ensure major changes are rolled out slowly enough to detect major performance shifts. Had their deployment process worked correctly, they may have been able to roll back the change without incident.

A second defense is proactively building "safeties" and "blowoff valves" into your software. Example: if a client notices a huge spike in errors, back off before retrying a connection request, otherwise you may put the system into a positive feedback loop. Ethernet collision detection/avoidance is a great example of a safety mechanism done well.

Finally, every high-scale domain has its own problems, which experienced engineers know to worry about. In my case, at an analytics provider, one of the hardest problems we face is data retention: how much to store, at what granularity, for how long, and how that interacts with our various plan tiers. OTOH we have significant latitude to be "eventually correct" or "eventually consistent" in a way a bank, stock exchange, or other transactional financial system (e.g. credit approval) can't be. I imagine other things like ad serving, video serving, game backend development, etc. there are similar "gotchas", but I don't know what they are.

QuotedForTruth · on Dec 17, 2014

New model airplanes always fly with miminum crew for their first test flights. This is classic risk reduction. Reduce the consequence of a problem when the probability of that problem occuring is greatest. Thats a better analogy than designing them to seat fewer passengers.