Have a meeting today with our AWS account team about how we’re no longer going t...

radium3d · 2025-10-20T19:09:23 1760987363

Once you've had an outage on AWS, Cloudflare, Google Cloud, Akismet. What are you going to do? Host in house? None of them seem to be immune from some outage at some point. Get your refund and carry on. It's less work for the same outcome.

CobrastanJorji · 2025-10-20T20:11:23 1760991083

Multi-cloud. It's fairly unlikely that AWS and Google Cloud are going to fail at the same time.

radium3d · 2025-10-20T20:19:57 1760991597

Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.

jancsika · 2025-10-20T21:02:32 1760994152

> double++

I'd suggest to ++double the cost. Compare:

++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy

double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0

radium3d · 2025-10-20T22:45:10 1761000310

Lol :)

dexterdog · 2025-10-20T22:25:37 1760999137

And likely far more than double the cost since you have to use the criminally-priced outbound bandwidth to keep everything in sync.

unethical_ban · 2025-10-20T21:48:24 1760996904

Shouldn't be double in the long term. Think of the second cloud as a cold standby. Depends on the system. Periodic replication of data layer (object storage/database) and CICD configured to be able to build services and VMs on multiple clouds. Have automatic tests weekly/monthly that represent end-to-end functionality, have scaled tests semi-annually.

This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.

Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.

jimbob45 · 2025-10-21T04:04:05 1761019445

And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.

yeswecatan · 2025-10-21T13:51:01 1761054661

How do you handle replication lag for databases?

zacmps · 2025-10-21T15:08:30 1761059310

If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.

Breza · 2025-10-21T18:34:06 1761071646

Why not host in house? If you have an application with stable resource needs, it can often be the cheaper and more stable option. At a certain scale, you can buy the servers, hire a sysadmin, and still spend less money than relying on AWS.

If you have an app that experiences 1000x demand spikes at unpredictable times then sure, go with the cloud. But there are a lot of companies that would be better off if they seriously considered their options before choosing the cloud for everything.

grogers · 2025-10-20T22:24:08 1760999048

Certainly if you aren't even multi-region, then multi-cloud is a pipe dream

bean469 · 2025-10-21T05:57:39 1761026259

> What are you going to do? Host in house?

Yep. Although it's just anecdata, it's what we do where I work - haven't had a slightest issue in years.

nxpnsv · 2025-10-21T06:22:57 1761027777

Cheaper, faster, in house people understands what’s going on. It should be a given for many services but somehow it’s not.

Breza · 2025-10-21T18:31:47 1761071507

I totally agree with you. Where I work, we self-host almost everything. Exceptions are we use a CDN for one area where we want lower latency, and we use BigQuery when we need to parse a few billion datapoints into something usable.

It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.

erikpukinskis · 2025-10-21T13:05:43 1761051943

On premise? Or do you build servers in a data center? Or do you lease dedicated servers?

bean469 · 2025-10-28T17:04:50 1761671090

We have our own data center with servers. The upfront costs are high, but it was worth it in our use-case

Breza · 2025-10-21T18:28:45 1761071325

Not GP, but my company also self-hosts. We rent rackspace in a colo. We used to keep my team's research server in the back closet before we went full-remote.

cakeday · 2025-10-20T21:43:33 1760996613

> Host in house?

Yes, mostly.

cmiles8 · 2025-10-20T10:46:38 1760957198

This. When Andy Jassy got challenged by analysts on the last earnings call on why AWS has fallen so far behind on innovation in areas his answer was a hand wavy response that diverted attention to say AWS is durable, stable, and reliable and customers care more about that. Oops.

judahmeek · 2025-10-20T10:58:33 1760957913

behind on innovation how exactly?

sharpy · 2025-10-20T16:48:52 1760978932

The culture changed. When I first worked there, I was encouraged to take calculated risks. When I did my second tour of duty, people were deathly afraid of bringing down services. It has been a while since my second tour of duty, but I don't think it's back to "Amazon is a place where builders can build".

everfrustrated · 2025-10-20T17:15:31 1760980531

Somewhat inevitable for any company as they get larger. Easy to move fast and break things when you have 1 user and no revenue. Very different story when much of US commerce runs you on.

AbstractH24 · 2025-10-21T14:46:16 1761057976

For folks who came of age in the late 00's, seeing companies once thought of as disruptors and innovators become the old stalwarts post-pandemic/ZIRP has been quite an experience.

Maybe those who have been around longer have seen this before, but its the first time for me.

oblio · 2025-10-22T13:31:21 1761139881

It's easy to be a hero when the going is easy.

llmslave · 2025-10-20T18:23:34 1760984614

If you bring something down in a real way, you can forget about someone trusting you with a big project in the future. You basically need to switch orgs

chaostheory · 2025-10-20T21:58:29 1760997509

Curious. When did AWS hit “Day Two”, or what year was your 2nd tour of duty?

sharpy · 2025-10-21T17:36:40 1761068200

When they added the CM bar raiser, I felt like it hit day 2. When was that? 2014ish?

RedShift1 · 2025-10-20T19:56:25 1760990185

I've never heard tour of duty being used outside of the military, is it really that bad over at AWS it has to be called that?

sharpy · 2025-10-20T20:33:53 1760992433

Nah, I used to work for defense contractors, and worked with ex-military people, so...

Anyway, I actually loved my first time at AWS. Which is why I went back. My second stint wasn't too bad, but I probably wouldn't go back, unless they offered me a lot more than what I get paid, but that is unlikely.

JCM9 · 2025-10-20T11:00:47 1760958047

I listened to the earnings call. I believe the question was mostly focused on why AWS has been so behind on AI. Jassy did flub the question quite badly and rambled on for a while. The press has mentioned the botched answer in a few articles recently.

etothet · 2025-10-20T11:43:13 1760960593

They have been pushing me and company extremely hard to vet their various AI-related offerings. When we decide to look into whatever service it is, we come away underwhelmed. It seems like their biggest selling point so far is “we’ll give it to you free for several months”. Not great.

nijave · 2025-10-20T23:22:31 1761002551

>we come away underwhelmed

In fairness, that's been my experience with everyone except OpenAI and Anthropic where I only occasionally come out underwhelmed

Really I think AWS does a fairly poor job bringing new services to market and it takes a while for them to mature. They excel much more in the stability of their core/old services--especially the "serverless" variety like S3, SQS, Lambda, EC2-ish, RDS-ish (well, today notwithstanding)

JCM9 · 2025-10-20T12:17:53 1760962673

I honestly feel bad for the folks at AWS whose job it is to sell this slop. I get AWS is in panic mode trying to catch up, but it’s just awful and frankly becoming quite exhausting and annoying for customers.

enjo · 2025-10-20T16:44:31 1760978671

AWS was gutted by layoffs over the last couple of years. None of this is surprising.

mcmcmc · 2025-10-20T16:46:16 1760978776

Why feel bad for them when they don’t? The paychecks and stock options keep them plenty warm at night.

JCM9 · 2025-10-20T17:07:42 1760980062

The comp might be decent but most folks I know that are still there say they’re pretty miserable and the environment is becoming toxic. A bit more pay only goes so far.

throw-the-towel · 2025-10-20T18:26:48 1760984808

Sorry, "becoming" toxic? Amazon has been famous for being toxic since forever.

hansvm · 2025-10-21T03:39:25 1761017965

It's a perspective issue. Amazon designs the first year to not "feel" toxic to most people. Thereafter, any semblance of propriety disappears.

worik · 2025-10-20T18:51:12 1760986272

> stock options

Timing.

If Amazon has peaked then they will not be worth much. Shares go down. Even in rising markets shares of failing companies go down...

Mind tho, Amazon has so much mind share they will need to fail harder to fail totally...

gregsadetsky · 2025-10-20T11:47:09 1760960829

Fascinating, thanks for sharing this.

I found this summary:

https://fortune.com/2025/07/31/amazon-aws-ai-andy-jassy-earn...

And the transcript (there’s an annoying modal obscuring a bit of the page, but it’s still readable):

https://seekingalpha.com/article/4807281-amazon-com-inc-amzn...

(search for the word “tough”)

ifwinterco · 2025-10-20T16:51:24 1760979084

Everything except us-east-1 is generally pretty reliable. At $work we have a lot of stuff that's only on eu-west-1 (yes not the best practice) and we haven't had any issues, touch wood

ttul · 2025-10-20T17:42:49 1760982169

My impression is that `us-east-1` has the worst reliability track record of any region. We've always run our stuff in `us-west-2` and there has never been an outage that took us down in that region. By contrast, a few things that we had in `us-east-1` have gone down repeatedly.

hnfong · 2025-10-20T17:24:26 1760981066

Just curious, what's special about us-east-1?

stego-tech · 2025-10-20T17:33:37 1760981617

It’s the “original” AWS region. It has the most legacy baggage, the most customer demand (at least in the USA), and it’s also the region that hosts the management layer of most “global” services. Its availability has also been dogshit, but because companies only care about costs today and not harms tomorrow, they usually hire or contract out to talent that similarly only cares about the bottom line today and throws stuff into us-east-1 rather than figure out AZs and regions.

The best advice I can give to any org in AWS is to get out of us-east-1. If you use a service whose management layer is based there, make sure you have break-glass processes in place or, better yet, diversify to other services entirely to reduce/eliminate single points of failure.

dijit · 2025-10-20T17:45:52 1760982352

I have a joke from 15 years ago, where I compared my friend who flaked out all the time as "having less availability than US-EAST-1".

This is not a new issue caused by improper investment, it's always been this way.

riknos314 · 2025-10-20T20:02:48 1760990568

Former AWS employee here. There's a number of reasons but it mostly boils down to:

It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.

mvkel · 2025-10-20T19:27:01 1760988421

It's closest to "geographical center" so traffic from Europe feels faster than us-west

tete · 2025-10-20T13:37:54 1760967474

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.

Was quite some time ago so I don't have the data, but AWS never came out on top.

It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.

testplzignore · 2025-10-21T07:19:51 1761031191

Netcraft confirmed it? I haven't heard that name since the Slashdot era :)

dotancohen · 2025-11-01T16:24:42 1762014282

Get off my lawn you insensitive clod!

chaostheory · 2025-10-20T22:01:34 1760997694

This makes sense given all the open source projects coming out of Netflix like chaos monkey.

eric-hu · 2025-10-21T18:36:28 1761071788

Which cloud provider came out on top?

llmslave · 2025-10-20T18:21:56 1760984516

AWS has been in long term decline, most of the platform is just in keeping the lights on mode. Its also why they are behind on AI, alot of would be innovative employees get crushed under red tape and performance management

nextworddev · 2025-10-20T18:45:31 1760985931

Good thing they are the biggest investor into Anthropic

GoblinSlayer · 2025-10-20T10:28:44 1760956124

But then you will be affected by outages of every dependency you use.

caymanjim · 2025-10-20T16:40:46 1760978446

This is the real problem. Even if you don't run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn't matter if those services are in other availability zones. AWS's own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.

It really is a single point of failure for the majority of the Internet.

dexterdog · 2025-10-20T22:32:18 1760999538

This becomes the reason to run in us-east-1 if you're going to be single region. When it's down nobody is surprised that your service is affected. If you're all-in on some other region and it goes down you look like you don't know what you're doing.

kelseydh · 2025-10-20T17:49:33 1760982573

This whole incident has been pretty uneventful down in Australia where everything AWS is on ap-southeast-2.

parliament32 · 2025-10-20T16:49:56 1760978996

> Even if you don't run anything in AWS directly, something you integrate with will.

Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"

caymanjim · 2025-10-20T18:00:59 1760983259

It's easy to say this, but in the real world, most of the critical path is heavily-dependent on third party integrations. User auth, storage, logging, etc. Even if you're somewhat-resilient against failures (i.e. you can live without logging and your app doesn't hard fail), it's still potentially going to cripple your service. And even if your entire app is resilient and doesn't fail, there are still bound to be tons of integrations that will limit functionality, or make the app appear broken in some way to users.

The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.

It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.

If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.

fauigerzigerk · 2025-10-20T18:40:45 1760985645

Clearly these are non-trivial trade-offs, but I think using third parties is not an either or question. Depending on the app and the type of third-party service, you may be able to make design choices that allow your systems to survive a third-party outage for a while.

E.g., a hospital could keep recent patient data on-site and sync it up with the central cloud service as and when that service becomes available. Not all systems need to be linked in real time. Sometimes it makes sense to create buffers.

But the downside is that syncing things asynchronously creates complexity that itself can be the cause of outages or worse data corruption.

I guess it's a decision that can only be made on a case by case basis.

jen20 · 2025-10-20T21:18:51 1760995131

With the exception of Amazon, anyone in this situation already has a third-party product in their critical path - AWS itself.

chasd00 · 2025-10-20T19:33:13 1760988793

> Why would a third-party be in your product's critical path?

i bet only 1-2% of AI startups are running their own models and the rest are just bouncing off OpenAI, Azure, or some other API.

thinkindie · 2025-10-20T17:51:57 1760982717

Not necessarily our critical path but today circleci was affected greatly which also affected our capacity to deploy. Luckily it was a Monday morning therefore we didn’t even have to deploy an hot fix.

pcdevils · 2025-10-20T18:03:09 1760983389

That's nearly every ai start-up done for

macintux · 2025-10-20T17:17:22 1760980642

No man is an island, entire of itself

unethical_ban · 2025-10-20T22:09:18 1760998158

* IAM / Okta * Cloud VPN services * Cloud Office (GSuite, Office365)

Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.

mlavrent · 2025-10-21T02:29:48 1761013788

The only ones I can really think of are the cloud providers themselves- I was at Microsoft, and absolutely everything was in-house (often to our detriment).

parliament32 · 2025-10-21T15:18:36 1761059916

I think you missed the "critical path" part. Why would your product stop functioning if your admins can't log in with IAM / VPN in, do you really need hands-on maintenance constantly? Why would your product stop functioning if Office is down, are you managing your ops in Excel or something?

"Some kind of dependency" is fine and unavoidable, but well-architected systems don't have hard downtime just because someone somewhere you have no control over fucked up.

unethical_ban · 2025-10-21T20:05:55 1761077155

Since 2020 for some reason lot of companies have fully remote workforce. If the VPN or auth goes down and workers can't login, that's a problem. Think banks, call center work, customer service.

1-6 · 2025-10-20T17:03:21 1760979801

Glad that you're taking the first step toward resiliency. At times, big outages like these are necessary to give a good reason why the company should Multicloud. When things are working without problems, no one cares to listen to the squeaky wheel.

morshu9001 · 2025-10-21T14:39:46 1761057586

This was a single region outage, right? If you aren't cross-region, cross-cloud is the same but harder

jen20 · 2025-10-20T21:16:17 1760994977

I would be interested in a follow up in 2-3 years as to whether you've had fewer issues with a multi-cloud setup than just AWS. My suspicion is that will not be the case.

FlynnLivesMattr · 2025-10-21T05:25:40 1761024340

How did the call go?

lootgraft · 2025-10-20T19:12:54 1760987574

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud.

If an internal "AWS team" then this translates to "I am comfortable using this tool, and am uninterested in having to learn an entirely new stack."

If you have to diversify your cloud workloads give your devops team more money to do so.

ej_campbell · 2025-10-21T03:35:41 1761017741

Aren't you deployed in multiple regions?

BoredPositron · 2025-10-20T17:45:57 1760982357

Still no serverless inference for models or inference pipes that are not available on bedrock, still no auto scaling GPU workers. We started bothering them in 2022...crickets

wrasee · 2025-10-20T10:51:57 1760957517

Please tell me there was a mixup and for some reason they didn’t show up.