Cloudflare Dashboard and API Service Issues

nikisweeting · on April 15, 2020

All our sites using Argo Tunnels go down whenever the API goes down, but the dashboard still claims Argo Tunnels are unaffected :/

At least we'll get a good blog post out of it in a few days.

rkwasny · on April 15, 2020

What's worse is that a status page is making it look like we are down and cloudflare is up...

buildbuildbuild · on April 15, 2020

Confirmed, and an interesting opportunity to spot who is using Argo Tunnels based on their current downtime.

Tunnels are one of Cloudflare’s best features for developers, instant NAT traversal for home-hosted demos and prototypes. It’s a shame you have to pay for Argo on your entire domain in order to use Argo Tunnels even on one subdomain.

Cloudflare, please offer origin tunnels as a separate service rather than bundling it with Argo routing on the client side.

jbergstroem · on April 15, 2020

I believe they are separate; its just naming confusion.

One is called "tiered routing" and is a dropdown in settings whereas the "nat" solution nowadays is implemented as cloudflared.

jbergstroem · on April 15, 2020

Yes, same here -- which equals site down. Nothing on status page yet.

> Apr 15 17:01:37 mysite cloudflared[12412]: time="2020-04-15T17:01:37Z" level=info msg="Connected to SJC" connectionID=0

> Apr 15 17:03:09 mysite cloudflared[12412]: time="2020-04-15T17:03:09Z" level=error msg="Register tunnel error from server side" connectionID=0 error="Server error: Reached maximum retry 5: dial tcp 198.41.248.96:9100: connect: connection timed out"

> Apr 15 17:03:09 mysite cloudflared[12412]: time="2020-04-15T17:03:09Z" level=info msg="Retrying in 8s seconds" connectionID=0

nikisweeting · on April 15, 2020

Here's our logs in case it helps anyone:

> hera ra | time="2020-04-15T17:08:09Z" level=error msg="Unable to dial edge" error="DialContext error: dial tcp 198.41.200.233:7844: i/o timeout"

> hera | time="2020-04-15T17:08:09Z" level=info msg="Retrying in 1s seconds" hera | time="2020-04-15T17:08:09Z" level=error msg="Unable to dial edge" error="Handshake with edge error: read tcp 172.18.0.2:36016->198.41.192.227:7844: i/o timeout"

> hera | time="2020-04-15T17:08:09Z" level=info msg="Retrying in 1s seconds" hera | time="2020-04-15T17:08:11Z" level=error msg="Unable to dial edge" error="Handshake with edge error: read tcp 172.18.0.2:44626->198.41.192.7:7844: i/o timeout"

> hera | time="2020-04-15T17:08:11Z" level=info msg="Retrying in 4s seconds"

> hera | time="2020-04-15T17:08:13Z" level=info msg="Connected to EWR"

> hera | time="2020-04-15T17:08:13Z" level=error msg="Register tunnel error from server side" connectionID=0 error="Server error: Reached maximum retry 5: dial tcp 198.41.248.96:9100: connect: connection timed out"

> hera | time="2020-04-15T17:08:13Z" level=warning msg="Tunnel disconnected due to error" error="Server error: Reached maximum retry 5: dial tcp 198.41.248.96:9100: connect: connection timed out"

> hera | time="2020-04-15T17:08:19Z" level=error msg="Unable to dial edge" error="Handshake with edge error: read tcp 172.18.0.2:51974->198.41.192.107:7844: i/o timeout"

rohansingh · on April 15, 2020

Dashboard must have updated, because it now says that Argo Tunnels is offline.

robertcope · on April 15, 2020

Yup, I am emailed my account rep about Argo tunnels being down and the status page not really reflecting that issue... which is much larger than the dashboard being offline or APIs not working, IMHO.

inapis · on April 15, 2020

Does Argo help much with performance? How much of a difference are we talking about here?

buildbuildbuild · on April 15, 2020

Argo Tunnels are most useful for security: you can serve HTTP from a server that allows no inbound connections. Or, you can use it to reliably serve a public site from behind NAT.

Argo routing’s ability to improve page load times depends on how international your user base is, and how poor your origin’s transit quality is. It’s great at improving the latency and reliability of a cheap host.

robertcope · on April 15, 2020

Argo is kind of an overloaded term... Argo (not tunneling) is better routing from the edge, which can have a measurable impact on performance. Argo (tunneling) can also impact performance in a positive way but it is also a nice way to secure things, provide access to secure/internal sites, provide ssh/rdp access into a datacenter, and other things.

rkwasny · on April 15, 2020

Update from CEO https://twitter.com/eastdakota/status/1250479501726760960

zymhan · on April 15, 2020

How on earth can rented "remote hands" at a datacenter take down the ENTIRE Cloudflare management layer?

mcpherrinm · on April 15, 2020

I don't know how big Cloudflare's management layer is, but I'm assuming it's relatively small (maybe dozens of racks at most), hosted in some sort of datacenter (Equinix, Digital Realty, Coresite, etc) that provides the remote hands.

Maybe some piece of core network equipment.

A small colo may have a single pair of routers, switches, or firewalls at its edge. If one had failed for some reason, and the remote hands removed the wrong one, it is possible you could knock the entire colo offline.

There's a bunch of other possible components: Storage platforms, power, maybe something like an HSM storing secrets, or even just a key database server.

Their failover to their backup facility may be impaired by the fact that well, their management plane is down. They probably rely on their own services. Avoiding chicken-and-egg issues can require careful ahead-of-time planning.

MichaelApproved · on April 15, 2020

and for over 3 hours? What could they do accidentally that would cause an outage for this long?

The tweet says they're failing over to their backup facility. I would've expected that fail over to happen much faster.

Seems like they have two issues going on. First, the remote hands could take down the datacenter. Second, their fail over is taking this long to come online.

I also wonder how much Covid impacted the process, if at all.

I'm looking forward to the details after they get things back online.

Good luck CF engineers!

eb0la · on April 15, 2020

Three hours for a physical problem like this is very little time. Once you discover your equipment is not there, you still have to put it on its place, power up, reconnect al cabling, and check networking.

From this kind of mess to a fully functional infrastructure you need at least 12h-16h to function minimally. Probably takes 2-3 days to have that node working as before.

When I was on call, I always jocked with my colleages the worst incident you could have in a datatenter was someone swapping two cables.

foobarbazetc · on April 15, 2020

Probably nuked the main DB.

hinkley · on April 15, 2020

"If things are painful you should do them more often," isn't supposed to be about callouses, but a lot of people take it that way. People with callouses do not experience the activity the way everybody else does. Sounds like a bit of that happened here.

I want my most of my mid-level people (and all of the ones bucking for a promotion) to be comfortable running 80-90% of routine maintenance operations and have pretty decent guesses on what should be done for the next 5-10%. I don't often get my way.

bogomipz · on April 15, 2020

From your link:

>"During planned maintenance remote hands decommissioned some equipment that they shouldn’t have. We’re failing over to a backup facility and working to get the equipment back online."

Some questions:

When the gear was first powered off why wasn't it just powered back up again as soon as the on-call person's phone started blowing up? Why does this require a "failover" to another datacenter?

Why are remote hands decommissioning Cloudflare's gear in the first place? Isn't this supposed to be a security focused company? For context "remote hands" is the term for people that work for the colocation facility such as Eqiunix, Telehouse, etc. They are not employees of the tenant(Cloudflare.) Remote-hands are great resources for things like checking a cable or running new cross connects etc but certainly not decommissioning gear without some form of tenant supervision.

Why is this a single point of failure?

davidu · on April 15, 2020

Just FYI -- Most colo facilities are prohibiting customer access during the COVID-19 lockdowns and have gone to 'smart hands' only for health and safety reasons.

bogomipz · on April 15, 2020

Sure but that would seem reason enough to postpone the scheduled maintenance then no? I would think trusting "remote hands" to decommission production gear would carry some serious weight in the risk/reward analysis of keeping the maintenance window. At any rate I would think that those same remote hands should have been able to immediately powered that gear back up as soon as they were made aware of the error.

tedk-42 · on April 15, 2020

If the risk of the change was minimal, why would they not proceed?

How can you plan for things occurring out of your control? CF engineers are people as well. Things like this happen and there will be learnings to take out of it (like how to fail over faster)

bogomipz · on April 15, 2020

What you mean "what if"? Clearly the risk was not minimal if it resulted in a 3 hour outage and affected customers. Someone at Cloudflare should have been able to identify that the services that were colocated at this facility had no redundancy prior to asking a third party to power down gear. Foregoing the maintenance until proper redundancy was in place was not out of their control.

Circulating a MOP(method of procedure) for data center maintenance among all stakeholder is pretty standard. The purpose of the MOPS is so that everyone can vet the plan(roll forward and roll back)and identify the risks.

davidu · on April 16, 2020

It's confusing to me to read this -- you write like you know what you're talking about, but that can't be the case because if you knew what you were talking about you'd understand how the kind of mistake that happened can happen.

As Matthew said on Twitter, this isn't the kind of mistake that happens twice. But for those of us who do operational work, it's easy to see how this happens once. Bit rot in a cab, work orders from someone who didn't know about the legacy patch panel or assumed too much in their instructions and a catastrophe. From now on, photos for every smart hands will probably be a part of the prep-procedure.

bogomipz · on April 17, 2020

>"It's confusing to me to read this -- you write like you know what you're talking about, but that can't be the case because if you knew what you were talking about you'd understand how the kind of mistake that happened can happen."

That's a pretty odd and flimsy argument that I can't know what I'm talking about if I hold an opinion that differs from your own. I have years of experience with this particular space. Had a MOP been circulated and the maintenance plan properly vetted then the need for visual confirmation because of "legacy patch panel" and "bit rot in cab" would have been identified. Further you never unplug a cable unless you first verify what the ends of the cable are connected to. As I mentioned this is "datacenter operations 101" stuff. This is not some new startup, this a publicly traded company who has been in the game for a decade now.

davidu · on April 22, 2020

It's not the your opinion differs, is that's you are presenting a cognitive dissonance. If you have years of experience then you would understand exactly how this happens, and yet, you think it was avoidable, which is a misunderstanding of operational failures.

If this failure was so easy to avoid, it would have been avoided. It was the result of a combination of failures, as all failures in highly-available systems are. It was a failure in bitrot, in planning, in execution, in communication, in reluctance to failover due to a concern about failing back, etc.

That said, like Matthew said, this is the kind of failure that happens once.

redm · on April 15, 2020

I wish cloudflarestatus.com (powered by StatusPage) offered a subscription (like https://status.box.com/, also on StatusPage) so you could get a pro-active notice about outages.

I had to debug customer issues to find out that this was down.

Even if they don't want to offer this to the general public (free customers), they should have another notice mechanism for enterprise customers.

dijital · on April 15, 2020

You could hook something up to their Status Page's RSS feed. We have a #provider-updates Slack channel for things like this.

https://www.cloudflarestatus.com/history.atom

htilford · on April 15, 2020

https://www.cloudflarestatus.com/history.rss and https://www.cloudflarestatus.com/history.json work as well.

robbiet480 · on April 15, 2020

If you find div#subscribe-modal-g7nd3k80rxdb on the page and remove the `hide` and `fade` classes and set `display: none` to `display: block` you will be able to use the hidden subscription modal.

htilford · on April 15, 2020

It's pull rather than push, but you can use the public API to get notified of incidents: https://www.cloudflarestatus.com/api/v2

patabyte · on April 15, 2020

StatusPage has native Subscriber Notifications [0]. Cloudflare must not have it setup on their end.

[0] https://status.io/features

burlesona · on April 16, 2020

Wrong URL, Statuspage is at https://statuspage.io :)

redm · on April 15, 2020

I know, that's my point. My guess is they have not enabled it because of the cost.

redm · on April 16, 2020

I found out Cloudflare does have a way to subscribe, you have to do it through an internal ticket though.

js4ever · on April 15, 2020

It's triggering the regular cloudflare error when trying to access their dashboard.

Error 522: If you're the owner of this website: Contact your hosting provider letting them know your web server is not completing requests. An Error 522 means that the request was able to connect to your web server, but that the request didn't finish. The most likely cause is that something on your server is hogging resources.

I can't wait to see the postmortem, I wonder if it's a DDOS, network/hardware issue or a deployment error

pul · on April 15, 2020

They've added the cause on their status page now: > [...] a disruption that occurred during a maintenance.

https://www.cloudflarestatus.com/incidents/g7nd3k80rxdb

robinhouston · on April 15, 2020

This has broken publishing in our app, which purges the file from the Cloudflare cache when something is republished. We’re ignoring errors from the Cloudflare API, but that isn’t enough in this case because it isn’t returning an error – it’s just hanging till the request times out.

We’re pushing an emergency config change to skip the cache invalidation, which will stop it timing out but means republished projects won’t update (because the old version will still be cached).

Godspeed to the Cloudflare engineers who are presumably scrambling to fix this!

darkerside · on April 15, 2020

Do you not have some kind of task runner you can offload this to? That seems like it would be of general benefit.

robinhouston · on April 15, 2020

The trouble with returning to the user before the cache has been purged is that they think it hasn’t worked, because they still see the old cached version.

darkerside · on April 16, 2020

I'm probably missing something but it sounds like cache invalidation happens on republishing, not on user request for content.

robinhouston · on April 16, 2020

You’re absolutely right. But user requests for content are made directly to Cloudflare, not through the app, and it’s common for users to click directly through to the published content after republishing.

darkerside · on April 16, 2020

Ah, that makes sense. You could store latest task status per content in your database and have a JavaScript component that polls and shows a message if cache invalidation is incomplete. But for how often Cloudflare goes down, it's probably not worth it.

arn · on April 15, 2020

Can't clear our cache via api or manually, so our cached HTML pages are stuck until the natural expire happens -- which are set somewhat long. Not great for a blog / news site. for example, if we publish a story, our front page won't reflect it.

jgrahamc · on April 15, 2020

Coming back now.

iampims · on April 16, 2020

Postmortem thread: https://news.ycombinator.com/item?id=22884833

verroq · on April 15, 2020

By degraded performance it means down completely.

_vbdg · on April 15, 2020

Yeah, dashboard still appears to be completely down.

haik90 · on April 15, 2020

yes, completely down.

I've trying for 10 minutes or so, it just loading (test from various location using VPN)

mattashii · on April 15, 2020

> "Cloudflare is continuing to investigate a network connectivity issue in the data center which serves API and dashboard functions."

This implies that CF hosts its API and dashboard all in one DC, which I find an _interesting_ observation. One would expect a company like CF to host its critical infrastructure in a redundant fashion.

tyingq · on April 15, 2020

It's certainly not ideal. But it's not unusual to spend a lot more time on making the runtime very redundant, but less time/money on dashboards and configuration change underpinnings. Doesn't work well in this case since it kills invalidating cache items for customers.

The comments seem to imply that having a redundant way to refresh the page cache, even if it were global/domain versus page, would be an okay backup for many.

mattashii · on April 15, 2020

I agree that first priority would be data integrety (which would be the runtime). But a large part of the CF experience of a CF customer would be the availability of their management APIs/dashboards, and that would be another part to optimize for.

I'm really suprised that they hosted all those non-vital but still quite critical services in just one DC, or somehow had one DC as a single point of failure. Network issues happen "regular enough" to want to protect against that, or at least have mitigations available.

bostik · on April 15, 2020

To be fair, you have these latent single points of failures even in the most resilient distributed systems.

Such as S3. The bucket names are globally unique, which means that their source of truth is in a single location. (Virginia, IIRC.)

Now... a small thought exercise. If I wanted to take down a Cloudflare datacenter and I had access to a few suitably careless remote hands, I'd take out the power supplies to the core routers, and while the external network is out of commission, power down the racks where they have their PXE servers. That should keep anything, within the DC, from being unable to recover on its own.

bithavoc · on April 15, 2020

The pushed a change to fix the DNS propagation 45 minutes ago[0]. Edge servers continue to proxy but no new records are being served.

[0]https://www.cloudflarestatus.com/incidents/57shkf1841kh

_ahs0 · on April 15, 2020

Noticed this yesterday around 5pm EST.

Edit: Also noticed that when generating API keys, the dropdown wouldn't list all my accounts for setting permissions. Just assumed it was all related or something.

Either way, overall super insanely reliable product/service and could not live without.

gramakri · on April 15, 2020

All DNS APIs are failing :(

zymhan · on April 15, 2020

As are our CircleCI piplines that call out to Cloudflare.

Time to get some fresh air!

capableweb · on April 15, 2020

[flagged]

jasongill · on April 15, 2020

Hanlon's Razor: "Never attribute to malice that which is adequately explained by stupidity"

_630w · on April 15, 2020

I am worried about you.

partiallypro · on April 15, 2020

While they are fixing this, could they please roll out a feature to allow me to assign users to only specific domains? Biggest complaint about Cloudflare, heck even GoDaddy lets you do that at no cost.

judge2020 · on April 15, 2020

You can, but only once you upgrade to Enterprise - the delegated dashboard per-zone access functionality is part of their business model.

See: https://www.sec.gov/Archives/edgar/data/1477333/000119312519... (RBAC)

partiallypro · on April 15, 2020

I said at no cost. Other registrars/DNS providers provide this for free.

nikisweeting · on April 15, 2020

https://www.cloudflare.com/en-ca/products/cloudflare-access/

https://developers.cloudflare.com/access/setting-up-access/

partiallypro · on April 15, 2020

This is not what I'm talking about, I'm talking about the ability to edit a zone.