You could cut your MongoDB costs by 100% by not using it ;)
> without sacrificing performance or reliability.
You're using a single server in a single datacenter. MongoDB Atlas is deployed to VMs on 2-3 AZs. You don't have close to the same reliability. (I'm also curious why their M40 instance costs $1000, when the Pricing Calculator (https://www.mongodb.com/pricing) says M40 is $760/month? Was it the extra storage?)
> We're building Prosopo to be resilient to outages, such as the recent massive AWS outage, so we use many different cloud providers
This means you're going to have multiple outages, AND incur more cross-internet costs. How does going to Hetzner make you more resilient to outages? You have one server in one datacenter. Intelligent, robust design at one provider (like AWS) is way more resilient, and intra-zone transfer is cheaper than going out to the cloud ($0.02/GB vs $0.08/GB). You do not have a centralized or single point of failure design with AWS. They're not dummies; plenty of their services are operated independently per region. But they do expect you to use their infrastructure intelligently to avoid creating a single point of failure. (For example, during the AWS outage, my company was in us-east-1, and we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run.)
I get it; these "we cut bare costs by moving away from the cloud" posts are catnip for HN. But they usually don't make sense. There's only a few circumstances where you really have to transfer out a lot of traffic, or need very large storage, where cloud pricing is just too much of a premium. The whole point of using the cloud is to use it as a competitive advantage. Giving yourself an extra role (sysadmin) in addition to your day job (developer, data scientist, etc) and more maintenance tasks (installing, upgrading, patching, troubleshooting, getting on-call, etc) with lower reliability and fewer services, isn't an advantage.
> Intelligent, robust design at one provider (like AWS) is way more resilient, and intra-zone transfer is cheaper than going out to the cloud ($0.02/GB vs $0.08/GB).
If traffic cost is relevant (which it is for a lot of use cases), Hetzner's price of $1.20/TB ($0.0012 / GB) for internet traffic [1] is an order of magnitude less than what AWS charges between AWS locations in the same metro. If you host only at providers with reasonable bandwidth charges, most likely all of your bandwidth will be billed at less than what AWS charges for inter-zone traffic. That's obscene. As far as I can tell, clouds are balancing their budgets on the back of traffic charges, but nothing else feels under cost either.
> For example, during the AWS outage, my company was in us-east-1, and we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run.
This doesn't always work out. During the GCP outage, my service was running fine, but other similar services were having trouble, so we attracted more usage, which we would have scaled up for, except that the GCP outage prevented that. Cloud makes it very expensive to run scaled beyond current needs and promises that scale out will be available to do just in time...
At some point our cross-AZ traffic for Elasticsearch replication at AWS was more expensive than what we'd pay to host the whole cluster replicated across multiple baremetal Hetzner servers.
Could we have done better with more sensible configs? Was it silly to cluster ES cross-AZ? Maybe. Point is that if you don't police every single detail of your platform at AWS/GCP and the like, their made-up charges will bleed your startup and grease their stock price.
turns out cross AZ is recommended for ES. perhaps our data team was rewritting the indices too often. but it was an internal requirement. so I think the data schema could have been more efficient to append deltas instead of reindexing all. but none of that will inflate your bill significantly at Hetzner. of course it will at AWS as that's how they incentivise clients to optimize and reduce their impact. and that's how you cut your runway by 3-6 months in compute heavy startups
> you're going to have multiple outages
us: 0, aws: 1. Looking good so far ;)
> AND incur more cross-internet costs
hetzner have no bandwidth traffic limit (only speed) on the machine, we can go nuts.
I understand you point wrt the cloud, but I spend as much time debugging/building a cloud deployment (atlas :eyes: ) as I do a self-hosted solution. Aws gives you all the tools to build a super reliable data store, but many people just chuck something on us-east-1 and go. There's you single point of failure.
Given we're constructing a many-node decentralised system, self-hosted actually makes more sense for us because we've already had to become familiar enough to create a many-node system for our primary product.
When/if we have a situation where we need high data availability I would strongly consider the cloud, but in the situations where you can deal with a bit of downtime you're massively saving over cloud offerings.
We'll post a 6-month and 1-year follow-up to update the scoreboard above
> many people just chuck something on us-east-1 and go
Even dropping something on a single EC2 node in us-east-1 (or at Google Cloud) is going to be more reliable over time than a single dedicated machine elsewhere.
This is because they run with a layer that will e.g. live migrate your running apps in case of hardware failures.
The failure modes of dedicated are quite different than those of the modern hyperscaler clouds.
It's not an apples-to-apples comparison, because EC2 and Google Cloud have ephemeral disk - persistent disk is an add-on, which is implemented with a complex and frequently changing distributed storage system
On the other hand, a Hetzner machine I just rented came with Linux software RAID enabled (md devices in the kernel)
---
I'm not aware of any comparisons, but I'd like to see see some
It's not straightforward, and it's not obvious the cloud is more reliable
The cloud introduces many other single points of failure, by virtue of being more complex
e.g. human administration failure, with the Unisuper incident
Of course, dedicated hardware could have a similar type of failure, but I think the simplicity means there is less variety in the errors.
e.g. A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable - Leslie Lamport
I just wish there was a way to underscore this more and more. Complex systems fail in complex ways. Sadly, for many programmers, the thrill or ego boost that comes with solving/managing complex problems lets us believe complex is better than simple.
One side effect of devops over the last 10-15yrs I've noticed as dev and ops converged is that infrastructure complexity exploded as the old school pessimistic sysadmin culture of simplicity and stability gave way to a much more optimistic dev culture. Also better tooling also enabled increased complexity in a self fulfilling feedback loop as more complexity also demanded better tooling.
Anecdotal, but a year ago we lost the whole RAID array in a rented Hetzner server to some hardware failure.
In a way, I think it doesn't matter what you use as long as you diversify enough (and have lots of backups), as everything can fail, and often the probability of failure doesn't even matter that much as any failure can be one too many.
Lets host it all with 2 companies instead and see how it goes.
Anyway random things you will encounter:
Azure doesn't work because frontdoor has issues (again, and again)
A webapp in Azure just randomly stops working, its not live migrated by any means, restarts don't work. Okay lets change SKU, change it back, oop its on a different baremetal cluster and now it works again. Sure there'll be some setup (read, upsell) that'll prevent such failures from reaching customers, but there is just simply no magic to any of this.
Really wish people would stop dreaming up reasons that hyperscalars are somehow magical places where issues don't happen and everything is perfect if you justtt increase the complexity a little bit more the next time around.
Hardware failures on server hardware at the scale of 1 machine are far less common than us-east-1 downtime
The typical failure mode of AWS is much better. Half the internet is down, so you just point at that and wait for everything to come back, and your instances just keep running. If you have one server you have to do the troubleshooting and recovery work. But you need to run more than one machine to get fewer nines of reliability
> Hardware failures on server hardware at the scale of 1 machine are far less common than us-east-1 downtime
A couple pieces of gentle pushback here:
- if you chose a hyperscaler, you should use their (often one-click) geographic redundancy & failover.
- All of the hyperscalers have more than one AZ. Specifically, there's no reason for any AWS customer to locate all/any* of their resources in us-east-1. (I actively recommend against this.)
* - Except for the small number of services only available in us-east-1, obviously.
Hetzner also offers more than one datacenter, which you should obviously use if you want geographic redundancy. But the comment I was replying was saying "Even dropping something on a single EC2 node in us-east-1", and for a single EC2 node in us-east-1 none of the things you are mentioning are possible without violating the premise
Thanks for sharing the story and committing to a 6-month and 1 year follow up. We will definitely be interested to hear further how it went over time.
In the mean time, I am curious where the time was spent debugging and building Atlas deployments? It certainly isn't the cheapest option, but it has been quite a '1 click' solution for us.
I’m curious about the resilience bit. Are you planning on some sort of active-active setup with mongo? I found it difficult on AWS to even do active-passive (i guess that was docdb), since programatically changing the primary write node instance was kind of a pain when failing over to a new region.
Going into any depth with mongo mostly taught me to just stick with postgres.
> You're using a single server in a single datacenter.
This is a common problem with “bare metal saved us $000/mo” articles. Bare metal is cheaper than cloud by any measure, but the comparisons given tend to be misleadingly exaggerated as they don't compare like-for-like in terms of redundancy and support, and after considering those factors it can be a much closer result (sometimes down as far as familiarity and personal preference being more significant).
Of course unless you are paying extra for multi-region redundancy things like the recent us-east-1 outage will kill you, and that single point of failure might not really matter if there are several others throughout your systems anyway, as is sometimes the case.
I think the problem is that the multi-az redundancy in AWS setups has saved me exactly zero times. The problem is nearly always some application issue.
If I'm storing data on a NAS, and I keep backups on a tape, a simple hardware failure that causes zero downtime on S3 might take what, hours to recover? Days?
If my database server dies and I need to boot a new one, how long will that take? If I'm on RDS, maybe five minutes. If it's bare metal and I need to install software and load my data into it, perhaps an hour or more.
Being able to recover from failure isn't a premature optimization. "The site is down and customers are angry" is an inevitability. If you can't handle failure modes in a timely manner, you aren't handling failure modes. That's not an optimization, that's table stakes.
It's not about five nines, it's about four nines or even three nines.
Backups are point in time snapshots of data, often created daily and sometimes stored on tape.
It's primary usecase is giving admins the ability to e.g restore partial data via export and similar. It can theoretically also be used to restore after you had a full data loss, but that's beyond rare. Almost no company has had that issue.
This is generally not what's used in high availability contexts. Usually, companies have at least one replica DB which is in read only and only needs to be "activated" in case of crashes or other disasters.
With that setup you're already able to hit 5 nines, especially in the context of b2e companies that usually deduct scheduled downtimes via SLA
> With that setup you're already able to hit 5 nines
This is "five nines every year except that one year we had two freak hardware failures at the same time and the site was hard down for eighteen hours".
"Almost no company has this problem" well I must be one incredibly unlucky guy, because I've seen incidents of this shape at almost every company I've worked at.
you have to look at all the factors, a simple server in a simple datacenter can be very very stable. When we were all doing bare metal servers back in the day server uptimes measured in years wasn't that rare.
This is true. Also some things are just fine, in fact sometimes better (better performing at the scale they actually need and easier to maintain, deploy, and monitor), as a single monolith instead of a pile of microservices. But when comparing bare metal to cloud it would be nice for people to acknowledge what their solution doesn't give, even if the acknowledgement comes with the caveat “but we don't care about that anyway because <blah>”.
And it isn't just about 9s of uptime, it is all the admin that goes with DR if something more terrible then a network outage does happen, and other infrastructure conveniences. For instance: I sometimes balk at the performance we get out of AzureSQL given what we pay for it, and in my own time you are safe to bet I'll use something else on bare metal, but while DayJob are paying the hosting costs I love the platform dealing with managing backup regimes, that I can do copies or PiT restores for issue reproduction and such at the click of the button (plus a bit of a wait), that I can spin up a fresh DB & populate it without worrying overly about space issues, etc.
I'm a big fan of managing your own bare metal. I just find a lot of other fans of bare metal to be more than a bit disingenuous when extolling its virtues, including cost-effectiveness.
It doesn't have to be one server in a single datacenter, though. It adds some complexity, but you could have a backup server ready to go at a different cheap provider (Hetzner and OVH, for example) and still save a lot.
> we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run.
I think it was just luck of the draw that the failure happened in this way and not some other way. Even if APIs falling over but EC2 instances remaining up is a slightly more likely failure mode, it means you can't run autoscaling, can't depend on spot instances which in an outage you can lose and can't replace.
> it means you can't run autoscaling, can't depend on spot instances which in an outage you can lose and can't replace
Yes, this is part of designing for reliability. If you use spot or autoscaling, you can't assume you will have high availability in those components. They're optimizations, like a cache. A cache can disappear, and this can have a destabilizing effect on your architecture if you don't plan for it.
This lack of planning is pretty common, unfortunately. Whether it's in a software component or system architecture, people often use a thing without understanding the implications of it. Then when AWS API calls become unavailable, half the internet falls over... because nobody planned for "what happens when the control plane disappears". (This is actually a critical safety consideration in other systems)
Sure, you can only use EC2, not use autoscaling or spot and instead just provision to your highest capacity needs, and not use any other AWS service that relies on dynamo as a dependency.
We still take some steps to mitigate control plane issues in what I consider a reasonable AWS setup (attempt to lock ASGs to prevent scale-down) but I place the control plane disappearing on the same level as the entire region going dark, and just run multi-region.
I think you underestimate how reduction in complexity can increase reliability. becoming a sysadmin for a single inexpensive server instance carries almost the same operational burden as operating an unavoidably very complicated cluster using a cloud provider.
Nowhere near the same. Admining a few servers is far easier than a mix of AWS cloud services, especially when they are either metal as a service or plain VMs.
not if you are using Atlas. Its as simple as it can be with way more functionality you can ever admin in yourself.
As others have said unless the scale of the data is the issue, if your switching because of cost, perhaps you should be going back to your business model instead.
Not if you don’t have hot replicated user data etc, assuming that matters, which it will unless you outsource auth and if you do that you’re back at square 1
It doesn't have to be only one server in one datacenter though.
It's more work, but you can have replicas ready to go at other Hetzner DCs (they offer bare metal at 3 locations in 2 different countries) or at other cheaper providers like OVH. Two or three $160 servers is still cheaper than what they're paying right now.
These types of posts make for excellent karma farming, but this one does present all the issues you've mentioned. Heck, Scaleway has managed Mongo for a bit more money and with redundancy and multi-AZ to boot. Were they trying to go as cheap as possible?
> For example, during the AWS outage, my company was in us-east-1, and we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run.
Naïve. If the network infrastructure is down, your computer goes down, it just happens that the functionality that went down you didn’t rely on. You could not rely on any functions at all by turning the server off, too.
I don't buy it. It really depends on your service, but I don't believe the reliability story. All large providers have had outages and I do host services on a single server that didn't have an outage in a few years.
Depends on the service and its complexity. More complexity means more outages. In most instances a focus on easy recoverability is more productive than preemptive "reliability". As I have said, depends on your service.
And prices get premium very fast if you have either a lot of traffic or low traffic but larger file interchange. And you have more work to do if you use the cloud, because it uses non-standard interfaces. Today a well maintained server is a few clicks away. Even for managed servers you have maintenance and configuration. Plus, your provider probably changes the service quite often. I had to accommodate beanstalk while my application was just running on its own, free of maintenance needs.
I've not actually seen an AZ go down in isolation, so whilst I agree its technically a less "robust" deployment, in practice its not that much of a difference.
> these "we cut bare costs by moving away from the cloud" posts are catnip for HN. But they usually don't make sense.
We moved away from atlas because they couldn’t cope with the data growth that we had(4tb is the max per DB). Turns out that its a fuck load cheaper even hosting on amazon (as in 50%). We haven't moved to hertzner because that would be more effort than we really want to expend, but its totally doable, with not that much extra work.
> more maintenance tasks (installing, upgrading, patching, troubleshooting, getting on-call, etc) with lower reliability and fewer services, isn't an advantage.
Depends right, firstly its not that much of an overhead, and if it saves you significant cash, then it increases your run rate.
> I've not actually seen an AZ go down in isolation
Counterpoint: I have. Maybe not completely down, but degraded, or out of capacity on an instance type, or some other silly issue that caused an AZ drain. It happens.
While I agree, I remember we once had cross-region replication for some product but when AWS was down the service was down anyway because of some dependency. Things were working fine during our DR exercises, but when the actual failure arrived, cross-region turned out useless.
At FastComments we have Mongo deployed across three continents and four regions on dedicated servers with full disk encryption, across two major providers just incase. It was setup by one person. Replication lag is usually under 300ms.
Usually AWS is pretty good at hiding all the reliability and robustness that goes onto into making a customer's managed service. Customers are not made aware what it takes.
An interesting experiment would be doing the equivalent at the scale of the median saas company.
Setup mongodb (or any database) so that you have geographically distributed nodes with replication+whatever else and maintain the same SLA as one of the big hyperscalers. Blog about how long did it take to setup, how hard is it to maintain, and how much are the ongoing costs.
My hunch is a setup on the scale of the median saas company is way more simple and cost effective than you'd think.
How often do the AZs matter? - I feel like there's a major global outage on every cloud provider of choice, at least every other year, yet I don't remember any outage where only a single AZ went down (I'm on AWS).
Fighting said outages is often made harder is that the providers themselves just don't admit to anything being wrong, everything's green on the dashboard yet 4 out of 5 requests are timing out.
You could cut your MongoDB costs by 100% by not using it ;)
> without sacrificing performance or reliability.
You're using a single server in a single datacenter. MongoDB Atlas is deployed to VMs on 2-3 AZs. You don't have close to the same reliability. (I'm also curious why their M40 instance costs $1000, when the Pricing Calculator (https://www.mongodb.com/pricing) says M40 is $760/month? Was it the extra storage?)
> We're building Prosopo to be resilient to outages, such as the recent massive AWS outage, so we use many different cloud providers
This means you're going to have multiple outages, AND incur more cross-internet costs. How does going to Hetzner make you more resilient to outages? You have one server in one datacenter. Intelligent, robust design at one provider (like AWS) is way more resilient, and intra-zone transfer is cheaper than going out to the cloud ($0.02/GB vs $0.08/GB). You do not have a centralized or single point of failure design with AWS. They're not dummies; plenty of their services are operated independently per region. But they do expect you to use their infrastructure intelligently to avoid creating a single point of failure. (For example, during the AWS outage, my company was in us-east-1, and we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run.)
I get it; these "we cut bare costs by moving away from the cloud" posts are catnip for HN. But they usually don't make sense. There's only a few circumstances where you really have to transfer out a lot of traffic, or need very large storage, where cloud pricing is just too much of a premium. The whole point of using the cloud is to use it as a competitive advantage. Giving yourself an extra role (sysadmin) in addition to your day job (developer, data scientist, etc) and more maintenance tasks (installing, upgrading, patching, troubleshooting, getting on-call, etc) with lower reliability and fewer services, isn't an advantage.