Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Are people considering moving off of Fly.io?
102 points by mind-blight on Feb 10, 2023 | hide | past | favorite | 45 comments
We're using fly at my work. It's had multiple outages in the last month that have taken down our production servers. There has been no proactive communication and very little insight besides "We've identified the issue and are attempting a fix."

We're now 24 hours into an outage that started with everything being taken offline, and is now causing intermittent 502 errors. Their status page (https://status.flyio.net/) still shows 99.99% uptime 24 hours into an outage.

Besides the outages, the service is great. But, that's a big caveat. We're pretty frustrated and are considering leaving.

Is anyone else in the same situation, and if so what's keeping you/what are you leaving for?



I'm the one who created this incident on our status page. I've been overly cautious in resolving this incident, but at this point I think it's causing more harm than good to keep it unresolved on there.

I think it might've prevented users from posting on our forums or sending in an email (premium support). I can imagine users looking at the status page and mistakenly thinking their problems were related to the current incident.

I've interpreted "Monitoring" as essentially meaning: "this is fixed, but we're keeping a close eye on the situation". We do not yet have a formal process for incidents such as this one (but we are working on that).

If our users are having issues, that's a problem. Looking at our own metrics, the community forum and our premium support inbox: I don't believe this to be the case.

Perhaps we should've done a better job at explaining the exact symptoms our users might be experiencing from this particular incident.


I really appreciate the context. We have an SPA with the frontend deployed on vercel and a GraphQL backend hosted on fly. The outage yesterday manifested as 502 errors being delivered to users on the frontend. We had another outage alert at 08:00 PST this morning that lasted about 5-10 minutes. It seemed like the same issue, so we didn't report another incident.

I really like fly, and I think you all are building a great product, but it's looking likely that we're going to migrate off of it. The biggest driver of that has been communication and issues with the status page. Specifically,

- When an incident occurs, we're often among the first to report it on the forum. Over the last month, the status page has lagged pretty significantly behind the incidents. This makes it feel like the we're discovering the issue before fly (I don't know if that's true, but that's the perception). Given that our automated tools are alerting us, it's disconcerting to feel like we're keeping a closer eye on our box's health than our cloud provider (again, this is perception based on communication lag, not necessarily reality).

- We have had multiple outages over the last month. In the middle of an outage, while there is an incident banner displayed at the top of the page, all systems show green with 99.98% or 99.99% uptime. That makes us not trust the numbers on the status page. This reinforces the above perception that fly's systems aren't being accurately monitored. Even now, the status page shows 100% uptime for all systems yesterday and today, which is not true.

- We emailed yesterday about our frustrations and concerns - specifically talking about the disconnect between fly's status page and the multiple outages. We explicitly called out the two points above, and how the communication up to this point has been "We've implemented a fix and are monitoring it". We asked for more details about what occurred, and what was being done to mitigate it in the future. The response was pretty boilerplate: "We're sorry you're frustrated. Here are some credits. We've implemented a fix and are monitoring it. Please let us know if you are still encountering issues."

The incidents were a problem, but disconnect between what was communicated and what occurred through multiple channels is what's driving us to leave. Here's what likely would have convinced us to stay:

- Over-communicate during the incident. I'd prefer to see more status updates rather than fewer.

- Having clear, proactive incident notification. Even with automated monitoring, things will slip through the cracks, but everything over the last month has felt reactive.

- Make sure the status page clearly reflects reality. If the system is down and everything shows green, then I'm 1) frustrated, and 2) wondering what else is slipping through the cracks.

- Publish retro docs or incident reports after an incident. Specifically, report what changes are being made to prevent an outage going forward.

- Train the support staff to communicate directly with developers. Boilerplate emails that focus on empathizing rather than informing are generally frustrating. Especially if they don't actually answer the questions being asked. I get that it's not reasonable to expect a support person to have an in-depth technical conversation, but this is where public incident reports (or live incident pages) can be really helpful.

I think you all are making a great product, but the issues with alerting, monitoring, and communication are too impactful for our production application. I'm confident you'll figure it out, but it's unlikely that we're going to wait.


> I think it's causing more harm than good to keep it unresolved on there.

Sorry, what? You have an open incident that you think should be shown as resolved as not doing so "causes more harm than good"?

Right, so lying to your customers about the state of an incident is better than just telling them the truth?


When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.

However, you don't know if it resolved everything because you are only working with the symptoms given by one user.

If another user has similar but not the same problem, they won't post about it if the situation is still unresolved. They dont know their case is different, and isn't being worked on.


> When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.

I hope not. Relying on "hope" when fixing prod is not a recipe for success in my book. It should ideally be possible to recreate the problem in a lesser environment, or at least get a level of comfort that fix will work based more on fact than "hope" before applying it.

Even then, if you are relegated to the level of hope and prayer when trying to handle an incident, it still doesn't mean you should close it unless you are *certain* it's fixed.

You can mark it as mitigated or fix applied, monitoring for xx period before marking as resolved or similar, surely.


I wholly agree. From what I see OP also agrees since they will now be using a stricter criteria to enable them to close more incidents earlier and only reopen when it's proven that there are other issues.


People should learn that using an intermediary other than AWS or Google Cloud for convenience is risky. All depends on your level of risk vs. screwing around, but if you want to go cheap then you should run your own instrumentation on top of bare Linux instances from commodity vendors that can be cycled out easily, and use multiple vendors to ensure outages at one are easily remedied.

Heroku is another example. Can’t trust your business to shaky foundations. The moment they started to have frequent outages your company should have been migrating ASAP.

As a side note, I would never use nor invest in brand new databases. Database tech needs to soak for 10+ years before I trust the software is stable and the organization behind it will exist longterm. A startup using a shiny new database is evidence of weak engineering leadership. Similarly, Terraform / Cloudformation is easy enough that needing something other than AWS tooling itself is making less sense from a cost vs. convenience perspective.


I'm not interested in becoming a cloud practitioner or getting an AWS/GCP/Azure certificate just to host my web app on AWS.

All I want to do is just git push my code and my app is distributed worldwide in the closest regions, all fast with no deploy scripts or convoluted formation tools.

Heroku and maybe Fly.io are the closest to this goal than other all the other solutions i've tried.

But anything that gets in my way of this goal is friction.


The other way to protect yourself is to design the app to be easily portable between cloud solutions.

A static site can easily move from fly to heroku to vercel to digital ocean. Something written in NodeJS can be moved around those just as easily.

Most apps will be naturally portable unless you do something very specfic to your provider.

The idea here is not to be "multi cloud" but to say "if this provider goes to shit, give me 2-4 hours, and I can deploy it somewhere else, change the DNS, and then go sip a pina colada".


For small potatoes stuff, yes. For critical business apps that make money and have real customers and jobs on the line, IMO too risky. It isn’t that hard to get to “push code and it deploys” directly with AWS nowadays.


You don't need any AWS certs to deploy a few lambdas and a couple of DynamoDB tables. Come on.

Your codebase, a few shell scripts, a yaml file for your CI/CD config, and another for your serverless framework definition and you're done -- couldn't be easier.


If it’s that small you could just host from home with a static ip or mapper.

If you’re building a commercial product where you’re being paid to provide a service, you’re obviously going to want something robust.


100% agreed. But I'd suggest you just go for a serverless offering from one of the big cloud players.

If your app runs on Heroku or Fly.io, you really don't need servers or VPCs or k8s clusters.

Where I work, we've gone all-in on AWS Serverless and couldn't be happier. We have 1 dev-ops/infra person supporting 3 feature teams, all of which are releasing into their apps based on this stack multiple times a day. The infrastructure overhead is so low that our dev-ops guy has time to additionally spend time on optimising CI/CD for speed, tightening up on security, etc, etc.

When we started, we had 2 devs who just did everything from a single repo, a gitlab account, and the serverless framework.

I really don't understand how adding yet another layer on top of AWS's / GCP's / Azure is materially going to change the developer or user experience. It just adds cost.


> I really don't understand how adding yet another layer on top of AWS's / GCP's / Azure is materially going to change the developer or user experience. It just adds cost.

Actually, AWS products (PaaS like Lambda) themselves are a layer atop AWS (IaaS like EC2) that you just said you immensely favour, and that it reduced costs (1 devops for 3 teams). Besides, Snowflake and Databricks are two examples among many of non-AWS but AWS-dependent billion dollar software shops that work just fine for the largest of enterprise businesses.

That said, don't think Fly is a layer on top of AWS (they rely on servers from NetActuate, Equinix, and others from the looks of it: https://news.ycombinator.com/item?id=29162706). They couldn't have the pricing they do if they were.


AWS regularly lies about their status on their status page.

Instead, if you have an issue you think is an AWS specific issue and if you spend enough money with AWS, you have a TAM. You reach out to your TAM who can give you the real status under AWS NDA.


I would love to use AWS but the usage based pricing has scared me away. There have been too many stories of running up bills in the tens of thousands from coding errors. If AWS had a way to make sure that didn’t happen I would be all over it.


We're likely going to move off of them. Last year we were using their Wireguard "peering" feature to connect our RDS DB (as recommended by their blog)[0].

This feature had a multi-hour outage, and when we wrote in for support, we were told "[t]he Wireguard peers are intended to get you development access to your network. We didn't really build them to handle inter service communication that affects uptime. The gateways we run wireguard peers on are not redundant."

We stopped using the feature (using Tailscale instead), but in my opinion, that directly contradicts the spirit of their blog and docs, and it really left a bad taste in our mouth. We're probably going to move to Render or something similar soon.

[0]: https://fly.io/blog/ipv6-wireguard-peering/#wireguard-peerin...


"[t]he Wireguard peers are intended to get you development access to your network. We didn't really build them to handle inter service communication that affects uptime"

Huh, such a strange response, shouldn't matter what my use case is (development vs service communication), if it's running it should be up.

Also, losing development access to your network seems like a weird thing to be OK with losing. Even if you just used it for development, wouldn't you want it to be accessible?


I used it a year ago and had to move off, just too many errors, a few seemingly lost deployments and needing in general to reconnect or turn things off/on to get them to work. Definitely felt very beta.

Final straw though really was testing DB. I had a $40/mo dedicated server and I spun up their recommended few-node cluster for postgres. Query response time was something like 5x faster for the dedicated server vs their similarly priced setup. I tried upgrading the the top of the line, still much slower and at that point many multiples more expensive.

It wasn't just that though, the entire app was sluggish, whereas locally or with a dedicated box it felt incredibly snappy. I'd have had to be spending something like ~2k/mo to get their top of the line nodes across every service and still would have to accept half the speed of my entire app. The edge isn't very useful if it's not powerful!

Disclosure: I work at Vercel, and I do like what Fly.io does generally. Had these opinions well before working at Vercel was even a consideration. I think a lot of serverless/edge type hosts are hiding their true cost behind cheap low powered nodes. Especially if the most powerful nodes are still less powerful than a very mid-tier dedicated box, there goes your entire app performance.


To be fair, this sounds like an apples to oranges comparison.

Given any "cloud" instace (VM), it's pretty much guaranteed it will be considerably slower than a dedicated server of same size.

Also, it seems you are comparing a local database (single dedicated server) with a clustered database ("few-node cluster").

That said, I totally get the cost-benefit argument and this is also why I started using dedicated servers for my own projects.


The point is it's actually impossible to get good performance, even when I scale the nodes up to their highest the raw CPU is a lot lower, and that really matters. Also the cluster has nothing to do with it as this was testing reads and clusters should be basically the same as all they add is a load balance to any one node. Not to mention I tried every combo: just one server vs two or three. Nothing made a difference. The diff was huge at the time, something like 3x.


For many (if not most) use cases, the latency is more important than the compute.


Definitely not true for my very common use case of Postgres + Hasura + Node API. Especially the DB is way faster based on compute. You're talking ~300+ms gains in perf just in the DB layer, with the graph and node layers also getting significant bumps. I was seeing a single dedicated host in Virginia be way, way faster (like 80% faster reply times) than the top of the line cluster that was in Phoenix (at the time very close to me).

If you are serving a page that needs to do a few joins across a non-trivial dataset that's all it requires to immediately tilt the equation completely back to raw CPU being the dramatic deciding factor in speed.


We are not. As a matter of fact we just renewed our annual contract with Heroku.

As disappointing as it is that Heroku is basically stalling, the fact is that it was light-years ahead of competition in terms of developer ergonomics. Even to this day, it's still super convenient and reliable-enough for us.

If anyone wants us to switch to their service they can't be as good as Heroku or slightly better. They need to be _much better_ to justify the costs of a switch.


Agreed, but the security incidents last year were enough to get us to switch. Our customers started asking about it w/r/t SOC2.

The developer ergonomics of heroku we didn't like were running lots of background tasks and long-running jobs. Heroku's way to do that has a 24 hour time limit and you have lots of limitations to run on for hardware. So a lot of data processing was a no-go.

If you have a 12 factor app that's not in a highly regulated industry and doesn't require a ton of background data processing, or you have invested in database architecture where you can run those tasks totally offline, i still think heroku is currently the best option.


What features would be there in this much better?


I don't really have a good answer for that. I'm happy with what have at Heroku. My pain points on it would be:

1. The annoying 30 second timeout they have (Which is not there in other platforms)

2. Steep pricing (Which I'm not really aware how comparable it is with Fly)

3. Lack of HTTP/2 support (Which is coming soon to Heroku I think)

4. The feeling of stagnation around Heroku

And Im not sure if resolving all of these would be enough of a incentive for me to make a move.


We considered moving onto Fly since we were transitioning away from Heroku. We ultimately decided on just AWS for our core products and digital ocean app engine for smaller experiments.

Fly's overall experience wasn't as smooth as Heroku, from the dashboards to the weird errors for technical things that should work but didn't. The logging and error handling wasn't as informative as it should be. In essence we agreed with the value proposition of "give us PaaS magic with more control over the infrastructure than Heroku" but it wasn't sufficiently magical. The whole low-latency cdn-like distribution angle wasn't really relevant to our use-case.


I found your post by searching “fly.io” to see if there was anyone else reporting problems with their hosted Postgres. I seemingly can’t make migrations and all I can find is a community post that’s slowly growing in responses where it was initially reported four days ago :/

I guess I’ll try out Render?


I’m on the fence. I don’t mind outages and as a relatively new service, there’s some expectation that there will be outages but the frequency and similarity of the outages is a little disconcerting. I’ve not yet moved off but I am reconsidering my choice to use them for production services when there are a variety of alternatives — Google Cloud Run is very reliable.

The unique aspect of their service (ability to containerise an application on your behalf) is not the important part for me so the only thing keeping me on Fly at the moment is inertia — which is a shame, I want to love Fly.


Had way too many errors trying to set things up, so I just switched to render.com, which I love.


Will you email your app details to support (cc me too, if you want). If you're app is 502ing, it's unrelated to yesterday's outage.


I considered Fly for our Rails app currently hosted on Dokku, but even 3 months ago there were grumblings that it was flaky and not quite suitable yet for production. So now we are considering Northflank, Render (if they get a London region) or Digital Ocean Kubernetes Service.


You get what you pay for :-/.

Everyone else seems to be more expensive.


I like them. But the outages have been tough.


Yes. There's a lot to like, but they're not evolving quickly enough. There are so many rough edges.


Ironically a new unresolved incident now, so I was initially not sure to which they were referring…


i was planning to use fly.io for my next project!

what are people using these days to deploy a node app (fastify/sqlite backend + vanillajs front end)? last time i deployed anything it was a rails app to an ec2 instance via capistrano - but that was eons ago.


I use fly.io for a side project and I have about 5 servers running there. I still think it's viable for side projects and my assumption is that the outages right now are growing pains that will get better with time.

I can totally understand why people using Fly in production would be frustrated and would consider moving. My product is free though and I have less than 50 users a day. To me, Fly's low price and ease of use is worth the instability since nothing I'm doing is mission critical.


disclaimer: I'm one of the founders

We're building klotho[0] for many of the reasons mentioned here. (happy to answer questions). We transform plain code to cloud native code. The majority of the complexity is moved into the Klotho compiler, and what devs handle is the simplest bundle that's easy to deploy and operate on public clouds using standard tools.

[0] https://github.com/KlothoPlatform/klotho


This submission broke the site rules by drastically editorializing the title. From https://news.ycombinator.com/newsguidelines.html: "Please use the original title, unless it is misleading or linkbait; don't editorialize."

Also, if you want to make an Ask HN, those are supposed to be text posts.

Normally I'd bury this altogether but because this is a YC startup and we moderate less in such cases*, I'm going to moderate it less in this case. Please don't do this in the future though.

* https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


Apologies, I meant to post this as an actual `Ask HN` to get the community's thoughts. I accidentally left the link in (which was originally planning on posting with a different title).

I can't edit the post, but I'd be happy to have the link dropped and https://news.ycombinator.com/item?id=34742946#34742947 turned in to the "Ask HN" text if that would make it better. That's what I was trying to do in the first place...


Ok, I've done that now. Thanks for the explanation!


Thank you!


[deleted]


Just in case any folks are thinking about switching and wondering where else to go, there have generally been 2 alternatives: - Another PaaS - people love render, Railway, Vercel. Heroku is still best-in-class even if not free. Replit has a PaaS built in now too that is getting very real. - Going to the raw cloud e.g. AWS or GCP. As much as folks say that terraform or pulumi or CDK has made this easy, it's just really not the same thing to get a great developer experience without a ton of work.

There's a new class of tools emerging that represents a 3rd way. withcoherence.com (I'm a cofounder) gives you the preview environments, built-in pipelines, and friendly UX that Vercel has set the standard with, while operating against your GCP or AWS account. Lock-in, uptime, service diversity, compliance, and pricing are all better on AWS/GCP than a PaaS. Coherence even adds a built-in Cloud IDE, giving you a gitpod or github codespace alternative with zero additional config or integration work.

Most of the "PaaS in your own cloud" category is a pile of kubernetes abstraction. Coherence is something different, that represents a real alternative for teams that are used to a great workflow, but who don't want to invest the time to glue together open source and IaaS, or who aren't a fit for enterprise grade CNCF-based tooling.

If anyone wants to check it out more or has any questions, happy to answer them or to help with migrations - just hit up hn@withcoherence.com!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: