Confirmed, and an interesting opportunity to spot who is using Argo Tunnels based on their current downtime.
Tunnels are one of Cloudflare’s best features for developers, instant NAT traversal for home-hosted demos and prototypes. It’s a shame you have to pay for Argo on your entire domain in order to use Argo Tunnels even on one subdomain.
Cloudflare, please offer origin tunnels as a separate service rather than bundling it with Argo routing on the client side.
Yup, I am emailed my account rep about Argo tunnels being down and the status page not really reflecting that issue... which is much larger than the dashboard being offline or APIs not working, IMHO.
Argo Tunnels are most useful for security: you can serve HTTP from a server that allows no inbound connections. Or, you can use it to reliably serve a public site from behind NAT.
Argo routing’s ability to improve page load times depends on how international your user base is, and how poor your origin’s transit quality is. It’s great at improving the latency and reliability of a cheap host.
Argo is kind of an overloaded term... Argo (not tunneling) is better routing from the edge, which can have a measurable impact on performance. Argo (tunneling) can also impact performance in a positive way but it is also a nice way to secure things, provide access to secure/internal sites, provide ssh/rdp access into a datacenter, and other things.
I don't know how big Cloudflare's management layer is, but I'm assuming it's relatively small (maybe dozens of racks at most), hosted in some sort of datacenter (Equinix, Digital Realty, Coresite, etc) that provides the remote hands.
Maybe some piece of core network equipment.
A small colo may have a single pair of routers, switches, or firewalls at its edge. If one had failed for some reason, and the remote hands removed the wrong one, it is possible you could knock the entire colo offline.
There's a bunch of other possible components: Storage platforms, power, maybe something like an HSM storing secrets, or even just a key database server.
Their failover to their backup facility may be impaired by the fact that well, their management plane is down. They probably rely on their own services. Avoiding chicken-and-egg issues can require careful ahead-of-time planning.
and for over 3 hours? What could they do accidentally that would cause an outage for this long?
The tweet says they're failing over to their backup facility. I would've expected that fail over to happen much faster.
Seems like they have two issues going on. First, the remote hands could take down the datacenter. Second, their fail over is taking this long to come online.
I also wonder how much Covid impacted the process, if at all.
I'm looking forward to the details after they get things back online.
Three hours for a physical problem like this is very little time.
Once you discover your equipment is not there, you still have to put it on its place, power up, reconnect al cabling, and check networking.
From this kind of mess to a fully functional infrastructure you need at least 12h-16h to function minimally. Probably takes 2-3 days to have that node working as before.
When I was on call, I always jocked with my colleages the worst incident you could have in a datatenter was someone swapping two cables.
"If things are painful you should do them more often," isn't supposed to be about callouses, but a lot of people take it that way. People with callouses do not experience the activity the way everybody else does. Sounds like a bit of that happened here.
I want my most of my mid-level people (and all of the ones bucking for a promotion) to be comfortable running 80-90% of routine maintenance operations and have pretty decent guesses on what should be done for the next 5-10%. I don't often get my way.
>"During planned maintenance remote hands decommissioned some equipment that they shouldn’t have. We’re failing over to a backup facility and working to get the equipment back online."
Some questions:
When the gear was first powered off why wasn't it just powered back up again as soon as the on-call person's phone started blowing up? Why does this require a "failover" to another datacenter?
Why are remote hands decommissioning Cloudflare's gear in the first place? Isn't this supposed to be a security focused company? For context "remote hands" is the term for people that work for the colocation facility such as Eqiunix, Telehouse, etc. They are not employees of the tenant(Cloudflare.) Remote-hands are great resources for things like checking a cable or running new cross connects etc but certainly not decommissioning gear without some form of tenant supervision.
Just FYI -- Most colo facilities are prohibiting customer access during the COVID-19 lockdowns and have gone to 'smart hands' only for health and safety reasons.
Sure but that would seem reason enough to postpone the scheduled maintenance then no? I would think trusting "remote hands" to decommission production gear would carry some serious weight in the risk/reward analysis of keeping the maintenance window. At any rate I would think that those same remote hands should have been able to immediately powered that gear back up as soon as they were made aware of the error.
If the risk of the change was minimal, why would they not proceed?
How can you plan for things occurring out of your control? CF engineers are people as well. Things like this happen and there will be learnings to take out of it (like how to fail over faster)
What you mean "what if"? Clearly the risk was not minimal if it resulted in a 3 hour outage and affected customers. Someone at Cloudflare should have been able to identify that the services that were colocated at this facility had no redundancy prior to asking a third party to power down gear. Foregoing the maintenance until proper redundancy was in place was not out of their control.
Circulating a MOP(method of procedure) for data center maintenance among all stakeholder is pretty standard. The purpose of the MOPS is so that everyone can vet the plan(roll forward and roll back)and identify the risks.
It's confusing to me to read this -- you write like you know what you're talking about, but that can't be the case because if you knew what you were talking about you'd understand how the kind of mistake that happened can happen.
As Matthew said on Twitter, this isn't the kind of mistake that happens twice. But for those of us who do operational work, it's easy to see how this happens once. Bit rot in a cab, work orders from someone who didn't know about the legacy patch panel or assumed too much in their instructions and a catastrophe. From now on, photos for every smart hands will probably be a part of the prep-procedure.
>"It's confusing to me to read this -- you write like you know what you're talking about, but that can't be the case because if you knew what you were talking about you'd understand how the kind of mistake that happened can happen."
That's a pretty odd and flimsy argument that I can't know what I'm talking about if I hold an opinion that differs from your own. I have years of experience with this particular space.
Had a MOP been circulated and the maintenance plan properly vetted then the need for visual confirmation because of "legacy patch panel" and "bit rot in cab" would have been identified. Further you never unplug a cable unless you first verify what the ends of the cable are connected to. As I mentioned this is "datacenter operations 101" stuff. This is not some new startup, this a publicly traded company who has been in the game for a decade now.
It's not the your opinion differs, is that's you are presenting a cognitive dissonance. If you have years of experience then you would understand exactly how this happens, and yet, you think it was avoidable, which is a misunderstanding of operational failures.
If this failure was so easy to avoid, it would have been avoided. It was the result of a combination of failures, as all failures in highly-available systems are. It was a failure in bitrot, in planning, in execution, in communication, in reluctance to failover due to a concern about failing back, etc.
That said, like Matthew said, this is the kind of failure that happens once.
I wish cloudflarestatus.com (powered by StatusPage) offered a subscription (like https://status.box.com/, also on StatusPage) so you could get a pro-active notice about outages.
I had to debug customer issues to find out that this was down.
Even if they don't want to offer this to the general public (free customers), they should have another notice mechanism for enterprise customers.
If you find div#subscribe-modal-g7nd3k80rxdb on the page and remove the `hide` and `fade` classes and set `display: none` to `display: block` you will be able to use the hidden subscription modal.
It's triggering the regular cloudflare error when trying to access their dashboard.
Error 522: If you're the owner of this website:
Contact your hosting provider letting them know your web server is not completing requests. An Error 522 means that the request was able to connect to your web server, but that the request didn't finish. The most likely cause is that something on your server is hogging resources.
I can't wait to see the postmortem, I wonder if it's a DDOS, network/hardware issue or a deployment error
This has broken publishing in our app, which purges the file from the Cloudflare cache when something is republished. We’re ignoring errors from the Cloudflare API, but that isn’t enough in this case because it isn’t returning an error – it’s just hanging till the request times out.
We’re pushing an emergency config change to skip the cache invalidation, which will stop it timing out but means republished projects won’t update (because the old version will still be cached).
Godspeed to the Cloudflare engineers who are presumably scrambling to fix this!
The trouble with returning to the user before the cache has been purged is that they think it hasn’t worked, because they still see the old cached version.
You’re absolutely right. But user requests for content are made directly to Cloudflare, not through the app, and it’s common for users to click directly through to the published content after republishing.
Ah, that makes sense. You could store latest task status per content in your database and have a JavaScript component that polls and shows a message if cache invalidation is incomplete. But for how often Cloudflare goes down, it's probably not worth it.
Can't clear our cache via api or manually, so our cached HTML pages are stuck until the natural expire happens -- which are set somewhat long. Not great for a blog / news site. for example, if we publish a story, our front page won't reflect it.
> "Cloudflare is continuing to investigate a network connectivity issue in the data center which serves API and dashboard functions."
This implies that CF hosts its API and dashboard all in one DC, which I find an _interesting_ observation. One would expect a company like CF to host its critical infrastructure in a redundant fashion.
It's certainly not ideal. But it's not unusual to spend a lot more time on making the runtime very redundant, but less time/money on dashboards and configuration change underpinnings. Doesn't work well in this case since it kills invalidating cache items for customers.
The comments seem to imply that having a redundant way to refresh the page cache, even if it were global/domain versus page, would be an okay backup for many.
I agree that first priority would be data integrety (which would be the runtime). But a large part of the CF experience of a CF customer would be the availability of their management APIs/dashboards, and that would be another part to optimize for.
I'm really suprised that they hosted all those non-vital but still quite critical services in just one DC, or somehow had one DC as a single point of failure. Network issues happen "regular enough" to want to protect against that, or at least have mitigations available.
To be fair, you have these latent single points of failures even in the most resilient distributed systems.
Such as S3. The bucket names are globally unique, which means that their source of truth is in a single location. (Virginia, IIRC.)
Now... a small thought exercise. If I wanted to take down a Cloudflare datacenter and I had access to a few suitably careless remote hands, I'd take out the power supplies to the core routers, and while the external network is out of commission, power down the racks where they have their PXE servers. That should keep anything, within the DC, from being unable to recover on its own.
Edit: Also noticed that when generating API keys, the dropdown wouldn't list all my accounts for setting permissions. Just assumed it was all related or something.
Either way, overall super insanely reliable product/service and could not live without.
While they are fixing this, could they please roll out a feature to allow me to assign users to only specific domains? Biggest complaint about Cloudflare, heck even GoDaddy lets you do that at no cost.
At least we'll get a good blog post out of it in a few days.