It's certainly not ideal. But it's not unusual to spend a lot more time on makin...

mattashii · on April 15, 2020

I agree that first priority would be data integrety (which would be the runtime). But a large part of the CF experience of a CF customer would be the availability of their management APIs/dashboards, and that would be another part to optimize for.

I'm really suprised that they hosted all those non-vital but still quite critical services in just one DC, or somehow had one DC as a single point of failure. Network issues happen "regular enough" to want to protect against that, or at least have mitigations available.

bostik · on April 15, 2020

To be fair, you have these latent single points of failures even in the most resilient distributed systems.

Such as S3. The bucket names are globally unique, which means that their source of truth is in a single location. (Virginia, IIRC.)

Now... a small thought exercise. If I wanted to take down a Cloudflare datacenter and I had access to a few suitably careless remote hands, I'd take out the power supplies to the core routers, and while the external network is out of commission, power down the racks where they have their PXE servers. That should keep anything, within the DC, from being unable to recover on its own.