Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
Once you've had an outage on AWS, Cloudflare, Google Cloud, Akismet. What are you going to do? Host in house? None of them seem to be immune from some outage at some point. Get your refund and carry on. It's less work for the same outcome.
Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.
++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy
double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0
Shouldn't be double in the long term. Think of the second cloud as a cold standby. Depends on the system. Periodic replication of data layer (object storage/database) and CICD configured to be able to build services and VMs on multiple clouds. Have automatic tests weekly/monthly that represent end-to-end functionality, have scaled tests semi-annually.
This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.
Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.
And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.
If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.
Why not host in house? If you have an application with stable resource needs, it can often be the cheaper and more stable option. At a certain scale, you can buy the servers, hire a sysadmin, and still spend less money than relying on AWS.
If you have an app that experiences 1000x demand spikes at unpredictable times then sure, go with the cloud. But there are a lot of companies that would be better off if they seriously considered their options before choosing the cloud for everything.
I totally agree with you. Where I work, we self-host almost everything. Exceptions are we use a CDN for one area where we want lower latency, and we use BigQuery when we need to parse a few billion datapoints into something usable.
It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.
Not GP, but my company also self-hosts. We rent rackspace in a colo. We used to keep my team's research server in the back closet before we went full-remote.
This. When Andy Jassy got challenged by analysts on the last earnings call on why AWS has fallen so far behind on innovation in areas his answer was a hand wavy response that diverted attention to say AWS is durable, stable, and reliable and customers care more about that. Oops.
The culture changed. When I first worked there, I was encouraged to take calculated risks. When I did my second tour of duty, people were deathly afraid of bringing down services. It has been a while since my second tour of duty, but I don't think it's back to "Amazon is a place where builders can build".
Somewhat inevitable for any company as they get larger. Easy to move fast and break things when you have 1 user and no revenue. Very different story when much of US commerce runs you on.
For folks who came of age in the late 00's, seeing companies once thought of as disruptors and innovators become the old stalwarts post-pandemic/ZIRP has been quite an experience.
Maybe those who have been around longer have seen this before, but its the first time for me.
If you bring something down in a real way, you can forget about someone trusting you with a big project in the future. You basically need to switch orgs
Nah, I used to work for defense contractors, and worked with ex-military people, so...
Anyway, I actually loved my first time at AWS. Which is why I went back. My second stint wasn't too bad, but I probably wouldn't go back, unless they offered me a lot more than what I get paid, but that is unlikely.
I listened to the earnings call. I believe the question was mostly focused on why AWS has been so behind on AI. Jassy did flub the question quite badly and rambled on for a while. The press has mentioned the botched answer in a few articles recently.
They have been pushing me and company extremely hard to vet their various AI-related offerings. When we decide to look into whatever service it is, we come away underwhelmed. It seems like their biggest selling point so far is “we’ll give it to you free for several months”. Not great.
In fairness, that's been my experience with everyone except OpenAI and Anthropic where I only occasionally come out underwhelmed
Really I think AWS does a fairly poor job bringing new services to market and it takes a while for them to mature. They excel much more in the stability of their core/old services--especially the "serverless" variety like S3, SQS, Lambda, EC2-ish, RDS-ish (well, today notwithstanding)
I honestly feel bad for the folks at AWS whose job it is to sell this slop. I get AWS is in panic mode trying to catch up, but it’s just awful and frankly becoming quite exhausting and annoying for customers.
The comp might be decent but most folks I know that are still there say they’re pretty miserable and the environment is becoming toxic. A bit more pay only goes so far.
Everything except us-east-1 is generally pretty reliable. At $work we have a lot of stuff that's only on eu-west-1 (yes not the best practice) and we haven't had any issues, touch wood
My impression is that `us-east-1` has the worst reliability track record of any region. We've always run our stuff in `us-west-2` and there has never been an outage that took us down in that region. By contrast, a few things that we had in `us-east-1` have gone down repeatedly.
It’s the “original” AWS region. It has the most legacy baggage, the most customer demand (at least in the USA), and it’s also the region that hosts the management layer of most “global” services. Its availability has also been dogshit, but because companies only care about costs today and not harms tomorrow, they usually hire or contract out to talent that similarly only cares about the bottom line today and throws stuff into us-east-1 rather than figure out AZs and regions.
The best advice I can give to any org in AWS is to get out of us-east-1. If you use a service whose management layer is based there, make sure you have break-glass processes in place or, better yet, diversify to other services entirely to reduce/eliminate single points of failure.
Former AWS employee here. There's a number of reasons but it mostly boils down to:
It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.
> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.
Was quite some time ago so I don't have the data, but AWS never came out on top.
It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.
AWS has been in long term decline, most of the platform is just in keeping the lights on mode. Its also why they are behind on AI, alot of would be innovative employees get crushed under red tape and performance management
This is the real problem. Even if you don't run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn't matter if those services are in other availability zones. AWS's own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.
It really is a single point of failure for the majority of the Internet.
This becomes the reason to run in us-east-1 if you're going to be single region. When it's down nobody is surprised that your service is affected. If you're all-in on some other region and it goes down you look like you don't know what you're doing.
> Even if you don't run anything in AWS directly, something you integrate with will.
Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"
It's easy to say this, but in the real world, most of the critical path is heavily-dependent on third party integrations. User auth, storage, logging, etc. Even if you're somewhat-resilient against failures (i.e. you can live without logging and your app doesn't hard fail), it's still potentially going to cripple your service. And even if your entire app is resilient and doesn't fail, there are still bound to be tons of integrations that will limit functionality, or make the app appear broken in some way to users.
The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.
It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.
If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.
Clearly these are non-trivial trade-offs, but I think using third parties is not an either or question. Depending on the app and the type of third-party service, you may be able to make design choices that allow your systems to survive a third-party outage for a while.
E.g., a hospital could keep recent patient data on-site and sync it up with the central cloud service as and when that service becomes available. Not all systems need to be linked in real time. Sometimes it makes sense to create buffers.
But the downside is that syncing things asynchronously creates complexity that itself can be the cause of outages or worse data corruption.
I guess it's a decision that can only be made on a case by case basis.
Not necessarily our critical path but today circleci was affected greatly which also affected our capacity to deploy. Luckily it was a Monday morning therefore we didn’t even have to deploy an hot fix.
Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.
The only ones I can really think of are the cloud providers themselves- I was at Microsoft, and absolutely everything was in-house (often to our detriment).
I think you missed the "critical path" part. Why would your product stop functioning if your admins can't log in with IAM / VPN in, do you really need hands-on maintenance constantly? Why would your product stop functioning if Office is down, are you managing your ops in Excel or something?
"Some kind of dependency" is fine and unavoidable, but well-architected systems don't have hard downtime just because someone somewhere you have no control over fucked up.
Since 2020 for some reason lot of companies have fully remote workforce. If the VPN or auth goes down and workers can't login, that's a problem. Think banks, call center work, customer service.
Glad that you're taking the first step toward resiliency. At times, big outages like these are necessary to give a good reason why the company should Multicloud. When things are working without problems, no one cares to listen to the squeaky wheel.
I would be interested in a follow up in 2-3 years as to whether you've had fewer issues with a multi-cloud setup than just AWS. My suspicion is that will not be the case.
Still no serverless inference for models or inference pipes that are not available on bedrock, still no auto scaling GPU workers. We started bothering them in 2022...crickets
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!