Ask HN: Are people considering moving off of Fly.io?

jeromegn · on Feb 10, 2023

I'm the one who created this incident on our status page. I've been overly cautious in resolving this incident, but at this point I think it's causing more harm than good to keep it unresolved on there.

I think it might've prevented users from posting on our forums or sending in an email (premium support). I can imagine users looking at the status page and mistakenly thinking their problems were related to the current incident.

I've interpreted "Monitoring" as essentially meaning: "this is fixed, but we're keeping a close eye on the situation". We do not yet have a formal process for incidents such as this one (but we are working on that).

If our users are having issues, that's a problem. Looking at our own metrics, the community forum and our premium support inbox: I don't believe this to be the case.

Perhaps we should've done a better job at explaining the exact symptoms our users might be experiencing from this particular incident.

mind-blight · on Feb 11, 2023

I really appreciate the context. We have an SPA with the frontend deployed on vercel and a GraphQL backend hosted on fly. The outage yesterday manifested as 502 errors being delivered to users on the frontend. We had another outage alert at 08:00 PST this morning that lasted about 5-10 minutes. It seemed like the same issue, so we didn't report another incident.

I really like fly, and I think you all are building a great product, but it's looking likely that we're going to migrate off of it. The biggest driver of that has been communication and issues with the status page. Specifically,

- When an incident occurs, we're often among the first to report it on the forum. Over the last month, the status page has lagged pretty significantly behind the incidents. This makes it feel like the we're discovering the issue before fly (I don't know if that's true, but that's the perception). Given that our automated tools are alerting us, it's disconcerting to feel like we're keeping a closer eye on our box's health than our cloud provider (again, this is perception based on communication lag, not necessarily reality).

- We have had multiple outages over the last month. In the middle of an outage, while there is an incident banner displayed at the top of the page, all systems show green with 99.98% or 99.99% uptime. That makes us not trust the numbers on the status page. This reinforces the above perception that fly's systems aren't being accurately monitored. Even now, the status page shows 100% uptime for all systems yesterday and today, which is not true.

- We emailed yesterday about our frustrations and concerns - specifically talking about the disconnect between fly's status page and the multiple outages. We explicitly called out the two points above, and how the communication up to this point has been "We've implemented a fix and are monitoring it". We asked for more details about what occurred, and what was being done to mitigate it in the future. The response was pretty boilerplate: "We're sorry you're frustrated. Here are some credits. We've implemented a fix and are monitoring it. Please let us know if you are still encountering issues."

The incidents were a problem, but disconnect between what was communicated and what occurred through multiple channels is what's driving us to leave. Here's what likely would have convinced us to stay:

- Over-communicate during the incident. I'd prefer to see more status updates rather than fewer.

- Having clear, proactive incident notification. Even with automated monitoring, things will slip through the cracks, but everything over the last month has felt reactive.

- Make sure the status page clearly reflects reality. If the system is down and everything shows green, then I'm 1) frustrated, and 2) wondering what else is slipping through the cracks.

- Publish retro docs or incident reports after an incident. Specifically, report what changes are being made to prevent an outage going forward.

- Train the support staff to communicate directly with developers. Boilerplate emails that focus on empathizing rather than informing are generally frustrating. Especially if they don't actually answer the questions being asked. I get that it's not reasonable to expect a support person to have an in-depth technical conversation, but this is where public incident reports (or live incident pages) can be really helpful.

I think you all are making a great product, but the issues with alerting, monitoring, and communication are too impactful for our production application. I'm confident you'll figure it out, but it's unlikely that we're going to wait.

bradwood · on Feb 11, 2023

> I think it's causing more harm than good to keep it unresolved on there.

Sorry, what? You have an open incident that you think should be shown as resolved as not doing so "causes more harm than good"?

Right, so lying to your customers about the state of an incident is better than just telling them the truth?

dismantlethesun · on Feb 11, 2023

When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.

However, you don't know if it resolved everything because you are only working with the symptoms given by one user.

If another user has similar but not the same problem, they won't post about it if the situation is still unresolved. They dont know their case is different, and isn't being worked on.

bradwood · on Feb 11, 2023

> When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.

I hope not. Relying on "hope" when fixing prod is not a recipe for success in my book. It should ideally be possible to recreate the problem in a lesser environment, or at least get a level of comfort that fix will work based more on fact than "hope" before applying it.

Even then, if you are relegated to the level of hope and prayer when trying to handle an incident, it still doesn't mean you should close it unless you are *certain* it's fixed.

You can mark it as mitigated or fix applied, monitoring for xx period before marking as resolved or similar, surely.

dismantlethesun · on Feb 11, 2023

I wholly agree. From what I see OP also agrees since they will now be using a stricter criteria to enable them to close more incidents earlier and only reopen when it's proven that there are other issues.

monero-xmr · on Feb 10, 2023

People should learn that using an intermediary other than AWS or Google Cloud for convenience is risky. All depends on your level of risk vs. screwing around, but if you want to go cheap then you should run your own instrumentation on top of bare Linux instances from commodity vendors that can be cycled out easily, and use multiple vendors to ensure outages at one are easily remedied.

Heroku is another example. Can’t trust your business to shaky foundations. The moment they started to have frequent outages your company should have been migrating ASAP.

As a side note, I would never use nor invest in brand new databases. Database tech needs to soak for 10+ years before I trust the software is stable and the organization behind it will exist longterm. A startup using a shiny new database is evidence of weak engineering leadership. Similarly, Terraform / Cloudformation is easy enough that needing something other than AWS tooling itself is making less sense from a cost vs. convenience perspective.

colesantiago · on Feb 10, 2023

I'm not interested in becoming a cloud practitioner or getting an AWS/GCP/Azure certificate just to host my web app on AWS.

All I want to do is just git push my code and my app is distributed worldwide in the closest regions, all fast with no deploy scripts or convoluted formation tools.

Heroku and maybe Fly.io are the closest to this goal than other all the other solutions i've tried.

But anything that gets in my way of this goal is friction.

quickthrower2 · on Feb 13, 2023

The other way to protect yourself is to design the app to be easily portable between cloud solutions.

A static site can easily move from fly to heroku to vercel to digital ocean. Something written in NodeJS can be moved around those just as easily.

Most apps will be naturally portable unless you do something very specfic to your provider.

The idea here is not to be "multi cloud" but to say "if this provider goes to shit, give me 2-4 hours, and I can deploy it somewhere else, change the DNS, and then go sip a pina colada".

monero-xmr · on Feb 10, 2023

For small potatoes stuff, yes. For critical business apps that make money and have real customers and jobs on the line, IMO too risky. It isn’t that hard to get to “push code and it deploys” directly with AWS nowadays.

bradwood · on Feb 11, 2023

You don't need any AWS certs to deploy a few lambdas and a couple of DynamoDB tables. Come on.

Your codebase, a few shell scripts, a yaml file for your CI/CD config, and another for your serverless framework definition and you're done -- couldn't be easier.

moomoo11 · on Feb 13, 2023

If it’s that small you could just host from home with a static ip or mapper.

If you’re building a commercial product where you’re being paid to provide a service, you’re obviously going to want something robust.

bradwood · on Feb 11, 2023

100% agreed. But I'd suggest you just go for a serverless offering from one of the big cloud players.

If your app runs on Heroku or Fly.io, you really don't need servers or VPCs or k8s clusters.

Where I work, we've gone all-in on AWS Serverless and couldn't be happier. We have 1 dev-ops/infra person supporting 3 feature teams, all of which are releasing into their apps based on this stack multiple times a day. The infrastructure overhead is so low that our dev-ops guy has time to additionally spend time on optimising CI/CD for speed, tightening up on security, etc, etc.

When we started, we had 2 devs who just did everything from a single repo, a gitlab account, and the serverless framework.

I really don't understand how adding yet another layer on top of AWS's / GCP's / Azure is materially going to change the developer or user experience. It just adds cost.

ignoramous · on Feb 11, 2023

> I really don't understand how adding yet another layer on top of AWS's / GCP's / Azure is materially going to change the developer or user experience. It just adds cost.

Actually, AWS products (PaaS like Lambda) themselves are a layer atop AWS (IaaS like EC2) that you just said you immensely favour, and that it reduced costs (1 devops for 3 teams). Besides, Snowflake and Databricks are two examples among many of non-AWS but AWS-dependent billion dollar software shops that work just fine for the largest of enterprise businesses.

That said, don't think Fly is a layer on top of AWS (they rely on servers from NetActuate, Equinix, and others from the looks of it: https://news.ycombinator.com/item?id=29162706). They couldn't have the pricing they do if they were.

iends · on Feb 13, 2023

AWS regularly lies about their status on their status page.

Instead, if you have an issue you think is an AWS specific issue and if you spend enough money with AWS, you have a TAM. You reach out to your TAM who can give you the real status under AWS NDA.

diwcoder · on Feb 12, 2023

I would love to use AWS but the usage based pricing has scared me away. There have been too many stories of running up bills in the tens of thousands from coding errors. If AWS had a way to make sure that didn’t happen I would be all over it.

firloop · on Feb 10, 2023

We're likely going to move off of them. Last year we were using their Wireguard "peering" feature to connect our RDS DB (as recommended by their blog)[0].

This feature had a multi-hour outage, and when we wrote in for support, we were told "[t]he Wireguard peers are intended to get you development access to your network. We didn't really build them to handle inter service communication that affects uptime. The gateways we run wireguard peers on are not redundant."

We stopped using the feature (using Tailscale instead), but in my opinion, that directly contradicts the spirit of their blog and docs, and it really left a bad taste in our mouth. We're probably going to move to Render or something similar soon.

[0]: https://fly.io/blog/ipv6-wireguard-peering/#wireguard-peerin...

capableweb · on Feb 10, 2023

"[t]he Wireguard peers are intended to get you development access to your network. We didn't really build them to handle inter service communication that affects uptime"

Huh, such a strange response, shouldn't matter what my use case is (development vs service communication), if it's running it should be up.

Also, losing development access to your network seems like a weird thing to be OK with losing. Even if you just used it for development, wouldn't you want it to be accessible?

nwienert · on Feb 10, 2023

I used it a year ago and had to move off, just too many errors, a few seemingly lost deployments and needing in general to reconnect or turn things off/on to get them to work. Definitely felt very beta.

Final straw though really was testing DB. I had a $40/mo dedicated server and I spun up their recommended few-node cluster for postgres. Query response time was something like 5x faster for the dedicated server vs their similarly priced setup. I tried upgrading the the top of the line, still much slower and at that point many multiples more expensive.

It wasn't just that though, the entire app was sluggish, whereas locally or with a dedicated box it felt incredibly snappy. I'd have had to be spending something like ~2k/mo to get their top of the line nodes across every service and still would have to accept half the speed of my entire app. The edge isn't very useful if it's not powerful!

Disclosure: I work at Vercel, and I do like what Fly.io does generally. Had these opinions well before working at Vercel was even a consideration. I think a lot of serverless/edge type hosts are hiding their true cost behind cheap low powered nodes. Especially if the most powerful nodes are still less powerful than a very mid-tier dedicated box, there goes your entire app performance.

napsterbr · on Feb 10, 2023

To be fair, this sounds like an apples to oranges comparison.

Given any "cloud" instace (VM), it's pretty much guaranteed it will be considerably slower than a dedicated server of same size.

Also, it seems you are comparing a local database (single dedicated server) with a clustered database ("few-node cluster").

That said, I totally get the cost-benefit argument and this is also why I started using dedicated servers for my own projects.

nwienert · on Feb 10, 2023

The point is it's actually impossible to get good performance, even when I scale the nodes up to their highest the raw CPU is a lot lower, and that really matters. Also the cluster has nothing to do with it as this was testing reads and clusters should be basically the same as all they add is a load balance to any one node. Not to mention I tried every combo: just one server vs two or three. Nothing made a difference. The diff was huge at the time, something like 3x.

Laaas · on Feb 10, 2023

For many (if not most) use cases, the latency is more important than the compute.

nwienert · on Feb 10, 2023

Definitely not true for my very common use case of Postgres + Hasura + Node API. Especially the DB is way faster based on compute. You're talking ~300+ms gains in perf just in the DB layer, with the graph and node layers also getting significant bumps. I was seeing a single dedicated host in Virginia be way, way faster (like 80% faster reply times) than the top of the line cluster that was in Phoenix (at the time very close to me).

If you are serving a page that needs to do a few joins across a non-trivial dataset that's all it requires to immediately tilt the equation completely back to raw CPU being the dramatic deciding factor in speed.

emilsedgh · on Feb 10, 2023

We are not. As a matter of fact we just renewed our annual contract with Heroku.

As disappointing as it is that Heroku is basically stalling, the fact is that it was light-years ahead of competition in terms of developer ergonomics. Even to this day, it's still super convenient and reliable-enough for us.

If anyone wants us to switch to their service they can't be as good as Heroku or slightly better. They need to be _much better_ to justify the costs of a switch.

mushufasa · on Feb 11, 2023

Agreed, but the security incidents last year were enough to get us to switch. Our customers started asking about it w/r/t SOC2.

The developer ergonomics of heroku we didn't like were running lots of background tasks and long-running jobs. Heroku's way to do that has a 24 hour time limit and you have lots of limitations to run on for hardware. So a lot of data processing was a no-go.

If you have a 12 factor app that's not in a highly regulated industry and doesn't require a ton of background data processing, or you have invested in database architecture where you can run those tasks totally offline, i still think heroku is currently the best option.

antupis · on Feb 10, 2023

What features would be there in this much better?

emilsedgh · on Feb 10, 2023

I don't really have a good answer for that. I'm happy with what have at Heroku. My pain points on it would be:

1. The annoying 30 second timeout they have (Which is not there in other platforms)

2. Steep pricing (Which I'm not really aware how comparable it is with Fly)

3. Lack of HTTP/2 support (Which is coming soon to Heroku I think)

4. The feeling of stagnation around Heroku

And Im not sure if resolving all of these would be enough of a incentive for me to make a move.

mushufasa · on Feb 10, 2023

We considered moving onto Fly since we were transitioning away from Heroku. We ultimately decided on just AWS for our core products and digital ocean app engine for smaller experiments.

Fly's overall experience wasn't as smooth as Heroku, from the dashboards to the weird errors for technical things that should work but didn't. The logging and error handling wasn't as informative as it should be. In essence we agreed with the value proposition of "give us PaaS magic with more control over the infrastructure than Heroku" but it wasn't sufficiently magical. The whole low-latency cdn-like distribution angle wasn't really relevant to our use-case.

doublepg23 · on Feb 20, 2023

I found your post by searching “fly.io” to see if there was anyone else reporting problems with their hosted Postgres. I seemingly can’t make migrations and all I can find is a community post that’s slowly growing in responses where it was initially reported four days ago :/

I guess I’ll try out Render?

phphphphp · on Feb 10, 2023

I’m on the fence. I don’t mind outages and as a relatively new service, there’s some expectation that there will be outages but the frequency and similarity of the outages is a little disconcerting. I’ve not yet moved off but I am reconsidering my choice to use them for production services when there are a variety of alternatives — Google Cloud Run is very reliable.

The unique aspect of their service (ability to containerise an application on your behalf) is not the important part for me so the only thing keeping me on Fly at the moment is inertia — which is a shame, I want to love Fly.

monological · on Feb 10, 2023

Had way too many errors trying to set things up, so I just switched to render.com, which I love.

mrkurt · on Feb 10, 2023

Will you email your app details to support (cc me too, if you want). If you're app is 502ing, it's unrelated to yesterday's outage.

tebbers · on Feb 11, 2023

I considered Fly for our Rails app currently hosted on Dokku, but even 3 months ago there were grumblings that it was flaky and not quite suitable yet for production. So now we are considering Northflank, Render (if they get a London region) or Digital Ocean Kubernetes Service.

itake · on Feb 10, 2023

You get what you pay for :-/.

Everyone else seems to be more expensive.

epoch_100 · on Feb 10, 2023

I like them. But the outages have been tough.

anacrolix · on Feb 11, 2023

Yes. There's a lot to like, but they're not evolving quickly enough. There are so many rough edges.

voganmother42 · on Feb 11, 2023

Ironically a new unresolved incident now, so I was initially not sure to which they were referring…

ehaveman · on Feb 10, 2023

i was planning to use fly.io for my next project!

what are people using these days to deploy a node app (fastify/sqlite backend + vanillajs front end)? last time i deployed anything it was a rails app to an ec2 instance via capistrano - but that was eons ago.

superfrank · on Feb 10, 2023

I use fly.io for a side project and I have about 5 servers running there. I still think it's viable for side projects and my assumption is that the outages right now are growing pains that will get better with time.

I can totally understand why people using Fly in production would be frustrated and would consider moving. My product is free though and I have less than 50 users a day. To me, Fly's low price and ease of use is worth the instability since nothing I'm doing is mission critical.

ashiban · on Feb 11, 2023

disclaimer: I'm one of the founders

We're building klotho[0] for many of the reasons mentioned here. (happy to answer questions). We transform plain code to cloud native code. The majority of the complexity is moved into the Klotho compiler, and what devs handle is the simplest bundle that's easy to deploy and operate on public clouds using standard tools.

[0] https://github.com/KlothoPlatform/klotho

dang · on Feb 10, 2023

This submission broke the site rules by drastically editorializing the title. From https://news.ycombinator.com/newsguidelines.html: "Please use the original title, unless it is misleading or linkbait; don't editorialize."

Also, if you want to make an Ask HN, those are supposed to be text posts.

Normally I'd bury this altogether but because this is a YC startup and we moderate less in such cases*, I'm going to moderate it less in this case. Please don't do this in the future though.

* https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

mind-blight · on Feb 10, 2023

Apologies, I meant to post this as an actual `Ask HN` to get the community's thoughts. I accidentally left the link in (which was originally planning on posting with a different title).

I can't edit the post, but I'd be happy to have the link dropped and https://news.ycombinator.com/item?id=34742946#34742947 turned in to the "Ask HN" text if that would make it better. That's what I was trying to do in the first place...

dang · on Feb 10, 2023

Ok, I've done that now. Thanks for the explanation!

mind-blight · on Feb 10, 2023

Thank you!

on Feb 10, 2023

[deleted]

zoomzoom · on Feb 10, 2023

Just in case any folks are thinking about switching and wondering where else to go, there have generally been 2 alternatives: - Another PaaS - people love render, Railway, Vercel. Heroku is still best-in-class even if not free. Replit has a PaaS built in now too that is getting very real. - Going to the raw cloud e.g. AWS or GCP. As much as folks say that terraform or pulumi or CDK has made this easy, it's just really not the same thing to get a great developer experience without a ton of work.

There's a new class of tools emerging that represents a 3rd way. withcoherence.com (I'm a cofounder) gives you the preview environments, built-in pipelines, and friendly UX that Vercel has set the standard with, while operating against your GCP or AWS account. Lock-in, uptime, service diversity, compliance, and pricing are all better on AWS/GCP than a PaaS. Coherence even adds a built-in Cloud IDE, giving you a gitpod or github codespace alternative with zero additional config or integration work.

Most of the "PaaS in your own cloud" category is a pile of kubernetes abstraction. Coherence is something different, that represents a real alternative for teams that are used to a great workflow, but who don't want to invest the time to glue together open source and IaaS, or who aren't a fit for enterprise grade CNCF-based tooling.

If anyone wants to check it out more or has any questions, happy to answer them or to help with migrations - just hit up hn@withcoherence.com!