Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.


It has been quite a while, wondering how many 9s are dropped.

365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.


9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.


In a company where I worked, the tool measuring downtime was at the same server, so even if the server was down they still showed 100% up.

If the server didnt work - the tool too measure didnt work too! Genius


This happened to AWS too.

February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.

https://aws.amazon.com/message/41926/



Five times is no longer a couple. You can use stronger words there.


It happened a murder of times.


Ha! Shall I bookmark this for the eventual wiki page?


https://www.youtube.com/watch?v=HxP4wi4DhA0

Maybe they should start using real software instead of mathematicians' toy langs


Have we ever figured out what “red” means? I understand they’ve only ever gone to yellow.


If it goes red, we aren't alive to see it


I'm sure we need to go to Blackwatch Plaid first.



Published in the same week of October ...9 years ago ...Spooky...


I used to work at a company where the SLA was measured as the percentage of successful requests on the server. If the load balancer (or DNS or anything else network) was dropping everything on the floor, you'd have no 500s and 100% SLA compliance.


Similar to hosting your support ticketing system with same infra. "What problem? Nobody's complaining"


I’ve been customer for at least four separate products where this was true.

I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>


I spent enough time ~15 years back to find an external monitoring service that did not run on AWS and looked like a sustainable business instead of a VC fueled acquisition target - for our belts-n-braces secondary monitoring tool since it's not smart to trust CloudWatch to be able to send notifications when it's AWS's shit that's down.

Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.

(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)


Nagios is still a thing and you can host it wherever you like.


Interestingly, the reason I originally looked for and started using it was an unapproved "shadow IT" response to an in-house Nagios setup that was configured and managed so badly it had _way_ more downtime than any of the services I'd get shouted about at if customers noticed them down before we did...

(No disrespect to Nagios, I'm sure a competently managed installation is capable of being way better than what I had to put up with.)


If its not on the dashboard, it didn't happen


Common SLA windows are hour, day, week, month, quarter, and year. They're out of SLA for all of those now.

When your SLA holds within a joke SLA window, you know you goofed.

"Five nines, but you didn't say which nines. 89.9999...", etc.


These are typically calculated system-wide, so if you include all regions, technically only a fraction of customers are impacted.


Customers in all regions were affected…


Indirectly yes but not directly.

Our only impact was some atlassian tools.


I shoot for 9 fives of availability.


5555.55555% Really stupendous availableness!!!


I see what you did there, mister :P


I prefer shooting for eight eights.


You mean nine fives.


You added a zero. There are ~8760 hours per year, so 8 hours is ~1 in 1000, 99.9%.


An outage like this does not happen every year, The last big outage happened in December 2021, roughly 3 years 10 month = 46 months ago.

The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.

They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.


Won’t the end result be people keeping more servers warm in other AWS regions which means Amazon profits from their own fuckups?


There was a pretty big outage 2023


Oh you are right!


I'm sure they'll find some way to weasel out of this.


For DynamoDB, I'm not sure but I think its covered. https://aws.amazon.com/dynamodb/sla/. "An "Error" is any Request that returns a 500 or 503 error code, as described in DynamoDB". There were tons of 5XX errors. In addition, this calculation uses percentage of successful requests, so even partial degradation counts against the SLA.

From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/

The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.


It's not down time, it's degradation. No outage, just degradation of a fraction[0] of the resources.

[0] Fraction is ~ 1


This 100% seems to be what they're saying. I have not been able to get a single Airflow task to run since 7 hours ago. Being able to query Redshift only recently came back online. Despite this all their messaging is that the downtime was limited to some brief period early this morning and things have been "coming back online". Total lie, it's been completely down for the entire business day here on the east coast.


We continue to see early signs of progress!


It doesn't count. It's not downtime, it's unscheduled maintenance event.


Check the terms of your contract. The public terms often only offer partial service credit refunds, if you ask for it, via a support request.


If you aren’t making $10 for every dollar you pay Amazon you need to look at your business model.

The refund they give you isn’t going to dent lost revenue.


Where were you guys the other day when someone was calling me crazy for trying to make this same sort of argument?


I haven't done any RFP responses for a while but this question always used to make me furious. Our competitors (some of who had had major incidents in the past) claimed 99.99% availability or more, knowing they would never have to prove it, and knowing they were actually 100% until the day they weren't.

We were more honest, and it probably cost us at least once in not getting business.


An SLA is a commitment, and an RFP is a business document, not a technical one. As an MSP, you don’t think in terms of “what’s our performance”, you think of “what’s the business value”.

If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.


it's a matter of perspective... 9.9999% is real easy


Only if you remember to spend your unavailability budget


It's a single region?

I don't think anyone would quote availability as availability in every region I'm in?

While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.

They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.


It’s THE region. All of AWS operates out of it. All other regions bow before it. Even the government is there.


"The Cloud" is just a computer that you don't own that's located in Reston, VA.


Facts.


The Rot Starts at the Head.


AWS GovCloud East is actually located in Ohio IIRC. Haven't had any issues with GovCloud West today; I'm pretty sure they're logically separated from the commercial cloud.


> All of AWS operates out of it.

I don't think this is true anymore. In the early days bad enough outages in us-east-1 would bring down everything because some metadata / control pane stuff was there, I remember getting affected while in other regions, but there's been many years since this has happened.

Today for example no issues. I just avoid us-east-1 and everyone else should to. It's their worst region by far in terms of reliability because they launch all the new stuff there and are always messing it up.


A secondary problem is that a lot of the internal tools are still on US East, so likely the response work is also being impacted by the outage. Been a while since there was a true Sev1 LSE (Large Scale Event).


What the heck? Most internal tools were in Oregon when I worked in BT pre 2021.


The primary ticketing system was up and down apparently, so tcorp/SIM must still have critical components there.


tell me it isn't true while telling me there isn't an outage across AWS because us-east-1 is down...


I help run quite a big operation in a different region and had zero issues. And this has happened many times before.


If that were true, you’d be seeing the same issues we are in us-west-1 as well. Cheers.


Global services such as STS have regional endpoints, but is it really that common to hit specific endpoint rather than use the default?


The regions are independent, so you measure availability for each on its own.


Except if they aren't quite as independent as people thought


Well that’s the default pattern anyway. When I worked in cloud there were always some services that needed cross-regional dependencies for some reason or other and this was always supposed to be called out as extra risk, and usually was. But as things change in a complex system, it’s possible for long-held assumptions about independence to change and cause subtle circular dependencies that are hard to break out of. Elsewhere in this thread I saw someone mentioning being migrated to auth that had global dependencies against their will, and I groaned knowingly. Sometimes management does not accept “this is delicate and we need to think carefully” in the midst of a mandate.

I do not envy anyone working on this problem today.


But is is a partial outage only, so it doesn't count. If you retry a million times everything still works /s


I'm wondering why your and other companies haven't just evicted themselves from us-east-1. It's the worst region for outages and it's not even close.

Our company decided years ago to use any region other than us-east-1.

Of course, that doesn't help with services that are 'global', which usually means us-east-1.


Several reasons, really:

1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"

2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.

3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.

4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.

5. Many Amazon features are available in that region first and then spread out to other locations.

6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.

7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?

It's the world's default hosting location, and today's outages show it.


> it's the cheapest region

In every SKU I've ever looked at / priced out, all of the AWS NA regions have ~equal pricing. What's cheaper specifically in us-east-1?

> Europe-friendly

Why not us-east-2?

> Many Amazon features are available in that region first and then spread out to other locations.

Well, yeah, that's why it breaks. Using not-us-east-1 is like using an LTS OS release: you don't get the newest hotness, but it's much more stable as a "build it and leave it alone" target.

> It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks.

This is a better argument, but in practice, it's very niche — 2-5ms of speed-of-light delay doesn't matter to anyone but HFT folks; anyone else can be in a DC one state away with a pre-arranged tier1-bypassing direct interconnect, and do fine. (This is why OVH is listed on https://www.cloudinfrastructuremap.com/ despite being a smaller provider: their DCs have such interconnects.)

For that matter, if you want "low-latency to North America and Europe, and high-throughput lowish-latency peering to many other providers" — why not Montreal [ca-central-1]? Quebec might sound "too far north", but from the fiber-path perspective of anywhere else in NA or Europe, it's essentially interchangeable with Virginia.


Lots of stuff is priced differently.

Just go to the EC2 pricing page and change from us-east-1 to us-west-1

https://aws.amazon.com/ec2/pricing/on-demand/


us-west-1 is the one outlier. us-east-1, us-east-2, and us-west-2 are all priced the same.


There are many other AWS regions than the ones you listed, and many different prices.


This seems like a flaw Amazon needs to fix.

Incentivize the best behaviors.

Or is there a perspective I don't see?


How is it a flaw!? Building datacenters in different regions come with very different costs, and different costs to run. Power doesn't cost exactly the same in different regions. Local construction services are not priced exactly the same everywhere. Insurance, staff salaries, etc, etc... it all adds up, and it's not the same costs everywhere. It only makes sense that it would cost different amounts for the services run in different regions. Not sure how you're missing these easy to realize facts of life.


I think the cost of a day like Monday due to over relying on a single location outweighs that


What happened on Monday has nothing to do with why services cost different prices in different regions.


No, but it does reflect the dangers of incentivizing everyone to use a single region.

Most people (myself include) only choose it because its the cheapest. If multiple regions were the same price then there'd be less impact if one goes down.


The problems with us-east-1 have been apparent for a long time, many years. Once I started using us-east-1 long ago, and seeing the problems there, I moved everything to us-west-1 and stopped having those problems. EC2 instances were completely unreliable in us-east-1 (we were running hundreds to thousands at a time), not so in us-west-1. The error rates we were seeing were awful in us-east-1.

A negligible cost difference shouldn't matter when your apps are unstable due to the region being problematic.


> A negligible cost difference shouldn't matter when your apps are unstable due to the region being problematic.

agreed, but a sizable cohort of people don't have the foresight or incentives for think past their nose and clicking the cheapest option.

So its on Amazon to incentivize what's best.


People's lack of curiosity, enough to not even explore the other options, is not Amazon's problem.


> 5. Many Amazon features are available in that region first and then spread out to other locations.

This is the biggest one isn't it? I thought Route 53 isn't even available on any other region.


Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.


> the occasional outage isn't worth the cost and effort of moving out.

And looked at from the perspective of an individual company, as a customer of AWS, the occasional outage is usually an acceptable part of doing business.

However, today we’ve seen a failure that has wiped out a huge number of companies used by hundreds of millions - maybe billions - of people, and obviously a huge number of companies globally all at the same time. AWS has something like 30% of the infra market so you can imagine, and most people reading this will to some extent have experienced, the scale of disruption.

And the reality is that whilst bigger companies, like Zoom, are getting a lot of the attention here, we have no idea what other critical and/or life and death services might have been impacted. As an example that many of us would be familiar with, how many houses have been successfully burgled today because Ring has been down for around 8 out of the last 15 hours (at least as I measure it)?

I don’t think that’s OK, and I question the wisdom of companies choosing AWS as their default infra and hosting provider. It simply doesn’t seem to be very responsible to be in the same pond as so many others.

Were I a legislator I would now be casting a somewhat baleful eye at AWS as a potentially dangerous monopoly, and see what I might be able to do to force organisations to choose from amongst a much larger pool of potential infra providers and platforms, and I would be doing that because these kinds of incidents will only become more serious as time goes on.


You're suffering from survivorship bias. You know that old adage about the bullet holes in the planes, and someone pointed out that you should reinforce that parts without bullet holes, because these are the planes that came back.

It's the same thing here. Do you think other providers are better? If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

At least this way, everyone knows why it's down, our industry has developed best practices for dealing with these kinds of outages, and AWS can apply their expertise to keeping all their customers running as long as possible.


> If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

That is the point, though: Correlated outages are worse than uncorrelated outages. If one payment provider has an outage, chose another card or another store and you can still buy your goods. If all are down, no one can shop anything[1]. If a small region has a power blackout, all surrounding regions can provide emergency support. If the whole country has a blackout, all emergency responders are bound locally.

[1] Except with cash – might be worth to keep a stash handy for such purposes.


Yeah, exactly this. I don’t know why the person who responded to me is talking about survivorship bias… and I suppose I don’t really care because there’s a bigger point.

The internet was originally intended to be decentralised. That decentralisation begets resilience.

That’s exactly the opposite of what we saw with this outage. AWS has give or take 30% of the infra market, including many nationally or globally well known companies… which meant the outage caused huge global disruption of services that many, many people and organisations use on a day to day basis.

Choosing AWS, squinted at through a somewhat particular pair of operational and financial spectacles, can often make sense. Certainly it’s a default cloud option in many orgs, and always in contention to be considered by everyone else.

But my contention is that at a higher level than individual orgs - at a societal level - that does not make sense. And it’s just not OK for government and business to be disrupted on a global scale because one provider had a problem. Hence my comment on legislators.

It is super weird to me that, apparently, that’s an unorthodox and unreasonable viewpoint.

But you’ve described it very elegantly: 99.99% (or pick the number of 9s you want) uptime with uncorrelated outages is way better than that same uptime with correlated, and particularly heavily correlated, outages.


That’s a pretty bold claim. Where’s your data to back it up?

More importantly you appear to have misunderstood the scenario I’m trying to avoid, which is the precise situation we’ve seen in the past 24 hours where a very large proportion of internet services go down all at the same time precisely because they’re all using the same provider.

And then finally the usual outcome of increased competition is to improve the quality of products and services.

I am very aware of the WWII bomber story, because it’s very heavily cited in corporate circles nowadays, but I don’t see that it has anything to do with what I was talking about.

AWS is chosen because it’s an acceptable default that’s unlikely to be heavily challenged either by corporate leadership or by those on the production side because it’s good CV fodder. It’s the “nobody gets fired for buying IBM” of the early mid-21st century. That doesn’t make it the best choice though: just the easiest.

And viewed at a level above the individual organisation - or, perhaps from the view of users who were faced with failures across multiple or many products and services from diverse companies and organisations - as with today (yesterday!) we can see it’s not the best choice.


This is an assumption.

Reality is, though, that you shouldn't put all your eggs in the same basket. And it was indeed the case before the cloud. One service going down would have never had this cascade effect.

I am not even saying "build your own DC", but we barely have resiliency if we all rely on the same DC. That's just dumb.


From the standpoint of nearly every individual company, it's still better to go with a well-known high-9s service like AWS than smaller competitors though. The fact that it means your outages will happen at the same time as many others is almost like a bonus to that decision — your customers probably won't fault you for an outage if everyone else is down too.

That homogeneity is a systemic risk that we all bear, of course. It feels like systemic risks often arise that way, as an emergent result from many individual decisions each choosing a path that truly is in their own best interests.


Yeah, but this is exactly not what the internet is supposed to be. It’s supposed to be decentralised. It’s supposed to be resilient.

And at this point I’m looking at the problem and thinking, “how do we do that other than by legislating?”

Because left to their own devices a concerningly large number of people across many, many organisations simply follow the herd.

In the midst of a degrading global security situation I would have thought it would be obvious why that’s a bad idea.


Services like SES Inbound are only available in 2x US regions. AWS isn't great about making all services available in all regions :/


We're on Azure and they are worse in every aspect, bad deployment of services, and status pages that are more about PR than engineering.

At this point, is there any cloud provider that doesn't have these problems? (GCP is a non-starter because a false-positive YouTube TOS violation get you locked out of GCP[1]).

[1]: https://9to5google.com/2021/02/26/stadia-port-of-terraria-ca...


Don't worry there was a global GCP outage a few months ago


Global auth is and has been a terrible idea.


[flagged]


That’s an incredibly long comment that does nothing to explain why a YouTube ToS violation should lead to someone’s GCP services being cut off.

Also, Steve Jobs already wrote your comment better. You should have just stolen it. “You’re holding it wrong”.


[flagged]


Are you warned about the risks in an active war one? Yes.

Does Google warn you about this when you sign up? No.

And PayPal having the same problem in no way identifies Google. It just means that PayPal has the same problem and they are also incompetent (and they also demonstrate their incompetence in many other ways).


s/in no way identifies Google/in no way indemnifies Google/

Sorry


> Sorry

No, thank you.


> It just means that PayPal has the same problem and they are also incompetent

Do you consider regular brick-and-mortar savings banks to be incompetent when they freeze someone's personal account for receiving business amounts of money into it? Because they all do, every last one. Because, again, they expect you to open a business account if you're going to do business; and they look at anything resembling "business transactions" happening in a personal account through the lens of fraud rather than the lens of "I just didn't realize I should open a business account."

And nobody thinks this is odd, or out-of-the-ordinary.

Do you consider municipal governments to be incompetent when they tell people that they have to get their single-family dwelling rezoned as mixed-use, before they can conduct business out of it? Or for assuming that anyone who is conducting business (having a constant stream of visitors at all hours) out of a residentially-zoned property, is likely engaging in some kind of illegal business (drug sales, prostitution, etc) rather than just being a cafe who didn't realize you can't run a cafe on residential zoning?

If so, I don't think many people would agree with you. (Most would argue that municipal governments suppress real, good businesses by not issuing the required rezoning permits, but that's a separate issue.)

There being an automatic level of hair-trigger suspicion against you on the part of powerful bureaucracies — unless and until you proactively provide those bureaucracies enough information about yourself and your activities for the bureaucracies to form a mental model of your motivations that makes your actions predictable to them — is just part of living in a society.

Heck, it's just a part of dealing with people who don't know you. Anthropologists suggest that the whole reason we developed greeting gestures like shaking hands (esp. the full version where you pull each-other in and use your other arms to pat one-another on the back) is to force both parties to prove to the other that they're not holding a readied weapon behind their backs.

---

> Are you warned about the risks in an active war one? Yes. Does Google warn you about this when you sign up? No.

As a neutral third party to a conflict, do you expect the parties in the conflict to warn you about the risks upon attempting to step into the war zone? Do you expect them to put up the equivalent of police tape saying "war zone past this point, do not cross"?

This is not what happens. There is no such tape. The first warning you get from the belligerents themselves of getting near either side's trenches in an active war zone, is running face-first into the guarded outpost/checkpoint put there to prevent flanking/supply-chain attacks. And at that point, you're already in the "having to talk yourself out of being shot" point in the flowchart.

It has always been the expectation that civilian settlements outside of the conflict zone will act of their own volition to inform you of the danger, and stop you from going anywhere near the front lines of the conflict. By word-of-mouth; by media reporting in newspapers and on the radio; by municipal governments putting up barriers preventing civilians from even heading down roads that would lead to the war zone. Heck, if a conflict just started "up the road", and you're going that way while everyone's headed back the other way, you'll almost always eventually be flagged to pull over by some kind stranger who realizes you might not know, and so wants to warn you that the only thing you'll get by going that way is shot.

---

Of course, this is all just a metaphor; the "war" between infrastructure companies and malicious actors is not the same kind of hot war with two legible "sides." (To be pedantic, it's more like the "war" between an incumbent state and a constant stream of unaffiliated domestic terrorists, such as happens during the ongoing only-partially-successful suppression of a populist revolution.)

But the metaphor holds: just like it's not a military's job to teach you that military forces will suspect that you're a spy if you approach a war zone in plainclothes; and just like it's not a bank's job to teach you that banks will suspect that you're a money launderer if you start regularly receiving $100k deposits into your personal account; and just like it's not a city government's job to teach you that they'll suspect you're running a bordello out of your home if you have people visiting your residentially-zoned property 24hrs a day... it's not Google's job to teach you that the world is full of people that try to abuse Internet infrastructure to illegal ends for profit; and that they'll suspect you're one of those people, if you just show up with your personal Google account and start doing some of the things those people do.

Rather, in all of these cases, it is the job of the people who teach you about life — parents, teachers, business mentors, etc — to explain to you the dangers of living in society. Knowing to not use your personal account for business, is as much a component of "web safety" as knowing to not give out details of your personal identity is. It's "Internet literacy", just like understanding that all news has some kind of bias due to its source is "media literacy."


You may not be aware of this, but Paypal is unregulated. They can, and have, overreached. This is very different from a bank who has regulations to follow, some of which protect the consumer from the whims of the bank.


I appreciate this long comment.

I am in the middle of convincing the company I just joined to consider building on GCP instead of AWS (at the very least, not to default to AWS).


If you can't figure out how to use a different Google account for YouTube from the GCP billing account, I don't know what to say. Google's in the wrong here, but spanner's good shit! (If you can afford it. and you actually need it. you probably don't.)


The problem isn't specifically getting locked out of GCP (though it is likely to happen for those out of the loop on what happened). It is that Google themselves can't figure out that a social media ban shouldn't affect your business continuity (and access to email or what-have-you).

It is an extremely fundamental level of incompetence at Google. One should "figure out" the viability of placing all of one's eggs in the basket of such an incompetent partner. They screwed the authentication issue up and, this is no slippery slope argument, that means they could be screwing other things up (such as being able to contact a human for support, which is what the Terraria developer also had issues with).


One of those still isn’t us-east-1 though and email isn’t latency-bound.


Except for OTP codes when doing 2fa in auth


100ms isn’t going to make a difference to email-based OTP.

Also, who’s using email-based OTP?


Same calculation everyone makes but that doesn’t stop them from whining about AWS being less than perfect.


We have discussions coming up to evict ourselves from AWS entirely. Didn't seem like there was much of an appetite for it before this but now things might have changed. We're still small enough of a company to where the task isn't as daunting as it might otherwise be.


So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.

I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.


Yep. Many, many companies are fine saying “we’re going to be no more available than AWS is.”


Customers are generally a lot more understanding if half the internet goes down at the same time as you.


Yes, and that's a major reason so many just use us-east-1.


Is there some reason why "global" services aren't replicated across regions?

I would think a lot of clients would want that.


> Is there some reason why "global" services aren't replicated across regions?

On AWS's side, I think us-east-1 is legacy infrastructure because it was the first region, and things have to be made replicable.

For others on AWS who aren't AWS themselves: because AWS outbound data transfer is exorbitantly expensive. I'm building on AWS, and AWS's outbound data transfer costs are a primary design consideration for potential distribution/replication of services.


It is absolutely crazy how much AWS charges for data. Internet access in general has become much cheaper and Hetzner gives unlimited AWS. I don't recall AWS ever decreasing prices for outbound data transfer


I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.

And yes, AWS' rates are highway robbery. If you assume $1500/mo for a 10 Gbps port from a transit provider, you're looking at $0.0005/GB with a saturated link. At a 25% utilization factor, still only $0.002/GB. AWS is almost 50 times that. And I guarantee AWS gets a far better rate for transit than list price, so their profit margin must be through the roof.


> I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.

Which makes sense, but even their rates for traffic between AWS regions are still exorbitant. $0.10/GB for transfer to the rest of the Internet somewhat discourages integration of non-Amazon services (though you can still easily integrate with any service where most of your bandwidth is inbound to AWS), but their rates for bandwidth between regions are still in the $0.01-0.02/GB range, which discourages replication and cross-region services.

If their inter-region bandwidth pricing was substantially lower, it'd be much easier to build replicated, highly available services atop AWS. As it is, the current pricing encourages keeping everything within a region, which works for some kinds of services but not others.


Even their transfer rates between AZs _in the same region_ are expensive, given they presumably own the fiber?

This aligns with their “you should be in multiple AZs” sales strategy, because self-hosted and third-party services can’t replicate data between AZs without expensive bandwidth costs, while their own managed services (ElastiCache, RDS, etc) can offer replication between zones for free.


Hetzner is "unlimited fair use" for 1Gbps dedicated servers, which means their average cost is low enough to not be worth metering, but if you saturate your 1Gbps for a month they will force you to move to metered. Also 10Gbps is always metered. Metered traffic is about $1.50 per TB outbound - 60 times cheaper than AWS - and completely free within one of their networks, including between different European DCs.

In general it seems like Europe has the most internet of anywhere - other places generally pay to connect to Europe, Europe doesn't pay to connect to them.


"Is there some reason why "global" services aren't replicated across regions?"

us-east-1 is so the government to slurp up all the data. /tin-foil hat


Data residency laws may be a factor in some global/regional architectures.


So provide a way to check/uncheck which zones you want replication to. Most people aren't going to need more than a couple of alternatives, and they'll know which ones will work for them legally.


My guess is that for IAM it has to do with consistency and security. You don't want regions disagreeing on what operations are authorized. I'm sure the data store could be distributed, but there might be some bad latency tradeoffs.

The other concerns could have to do with the impact of failover to the backup regions.


Regions disagree on what operations are authorized. :-) IAM uses eventual consistency. As it should...

"Changes that I make are not always immediately visible": - "...As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any changes that you make in IAM (or other AWS services), including attribute-based access control (ABAC) tags, take time to become visible from all possible endpoints. Some delay results from the time it takes to send data from server to server, replication zone to replication zone, and Region to Region. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out...

...You must design your global applications to account for these potential delays. Ensure that they work as expected, even when a change made in one location is not instantly visible at another. Such changes include creating or updating users, groups, roles, or policies. We recommend that you do not include such IAM changes in the critical, high availability code paths of your application. Instead, make IAM changes in a separate initialization or setup routine that you run less frequently. Also, be sure to verify that the changes have been propagated before production workflows depend on them..."

https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoo...


Global replication is hard and if they weren't designed with that in mind its probably a whole lot of work.


I thought part of the point of using AWS was that such things were pretty much turnkey?\


Mostly AWS relies on each region being its own isolated copy of each service. It gets tricky when you have globalized services like IAM. AWS tries to keep those to a minimum.


One advantage to being in the biggest region: when it goes down the headlines all blame AWS, not you. Sure you’re down too, but absolutely everybody knows why and few think it’s your fault.


For us, we had some minor impacts but most stuff was stable. Our bigger issue was 3rd party SaaS also hosted on us-east-1 (Snowflake and CircleCI) which broke CI and our data pipeline


This was a major issue, but it wasn't a total failure of the region.

Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.

I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.

We definitely learnt something here about both our software and our 3rd party dependencies.


cheapest + has the most capacity


You have to remember that health status dashboards at most (all?) cloud providers require VP approval to switch status. This stuff is not your startup's automated status dashboard. It's politics, contracts, money.


Which makes them a flat out lie since it ceases to be a dashboard if it’s not live. It’s just a status page.


Downdetector had 5,755 reports of AWS problems at 12:52 AM Pacific (3:53 AM Eastern).

That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).

However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).

Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.


Where do they source those reports from? Always wondered if it was just analysis of how many people are looking at the page, or if humans somewhere are actually submitting reports.


It turns out that a bunch of people checking if "XYZ is down" is a pretty good heuristic for it actually being down. It's pretty clever I think.


It's both. They count a hit from google as a report of that site being down. They also count that actual reports people make.


So if my browser auto-completes their domain name and I accept that (causing me to navigate directly to their site and then I click AWS) it's not a report; but if my browser doesn't or I don't accept it (because I appended "AWS" after their site name) causing me to perform a Google search and then follow the result to the AWS page on their site, it's a report? That seems too arbitrary... they should just count the fact that I went to their AWS page regardless of how I got to it.


I don't know the exact details, but I know that hits to their website do count as reports, even if you don't click "report". I assume they weight it differently based on how you got there (direct might actually be more heavily weighted, at least it would be if I was in charge).


Down detector agrees: https://downdetector.com/status/amazon/

Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status


Search, Seller Central, Amazon Advertising not working properly for me. Attempting to access from New York.

When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.


Amazon Ads is down indeed https://status.ads.amazon.com/


This looks like one their worst outage in 15 years and us-east-1 still shows as degraded but I had no outages, as dont use us-east-1. Are you seeing issues on other regions?

https://health.aws.amazon.com/health/status?path=open-issues

The closest to their identification of a root cause seems to be this one:

"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."


I wonder how many people discovered their autoscaling settings went batshit when services went offline, either scaling way down or way up, or went metastable and started fishtailing.


Lambda create-function control plane operations are still failing with InternalError for us - other services have recovered (Lambda, SNS, SQS, EFS, EBS, and CloudFront). Cloud availability is the subject of my CS grad research, I wrote a quick post summarizing the event timeline and blast radius as I've observed it from testing in multiple AWS test accounts: https://www.linkedin.com/pulse/analyzing-aws-us-east-1-outag...


Definitely seems to be getting worse, outside of AWS itself, more websites seem to be having sporadic or serious issues. Concerning considering how long the outage has been going.


That's probably why Reddit has been down too


Dangerous curiosity ask, is whether the number of folks off for Diwali is a factor or not?

I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.


Northern Virginia's Fairfax County public schools have the day off for Diwali, so that's not an unreasonable question.

In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.

If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.


It's more worse if caused by American engineers , not on holiday


Seems like a lot of people missing that this post was made around midnight PST time and thus it would be more reasonable to ping people at lunch in IST before waking up people in EST or PST.


More info is claiming the problem started around 9:15 the previous day, but brewed for a while. But that’s still after breakfast in IST.


Sometimes I miss my phone buzzing when doing yard work. Diwali has to be worse for that.


Seeing as how this is us-east-1, probably not a lot.


I believe the implication is that a lot of critical AWS engineers are of Indian descent and are off celebrating today.


junon's implication may be that AWS engineers of Indian descent would tend to be located on the West Coast.


North Virginia has a very large Indian community.

All the schools in the area have days off for Indian Holidays since so many would be out of school otherwise.


This broke in the middle of the day IST did it not? Why would you start waking up people in VA if it’s 3 in the morning there if you don’t have to?


I bet you haven't gotten an email back from AWS support during twilight hours before.

There are 153k Amazon employees based in India according to LinkedIn.


Missing my point entirely.


Then I missed it too because I let my Indian coworkers handle production issues after 9,10pm unless the problem sounds an awful lot like the feature toggle I flipped on in production is setting servers on fire.

My main beef with that team was that we worked on too many stories in parallel so information on brand new work was siloed. Everyone caught up after a bit but stuff we just or hadn’t demoed yet was spotty for coverage.

If I was up at 1 am it was because I had insomnia and figured out exactly what the problem was and it was faster to fix it than to explain. Or if I wake up really early and the problem is still not fixed.


worst of all: ring alarm unstoppable siren because the app is down and the keyboard was removed by my parents and put "somewhere in the basement".


Is it hard wired? If so, and if the alarm module doesn’t have an internal battery, can you go to the breaker box and turn off the circuit it’s on? You should be able to switch off each breaker in turn until it stops if you don’t know which circuit it’s on.

If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.

Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.


I'll keep it in mind, thx. I was lucky to find the keypad in the "this is the place where we put electronic shit" in the basement.


Nice. Well, whatever, I’m glad you managed to stop it from driving you up the wall.


I have a Ring alarm. It has a battery backup and is powered by AC adaptor, so no need to turn off entire circuits (but no easy silence). All the sensors I have are wireless (not sure if they offer wired).

I would honestly do your box option. Stuff it in there with some pillows and leave it in the shed for a while.


Yeah, we’ve got a bunch of Ring stuff but not the interior alarm so I wasn’t sure how it worked. I suspected it might have a battery backup and, in that case, desperate times -> desperate measures.


Yeah. We had a brief window where everything resolved and worked and now we're running into really mysterious flakey networking issues where pods in our EKS clusters timeout talking to the k8s API.


Yeah, networking issues cleared up for a few hours but now seem to be as bad as before.


The problems now seem mostly related to starting new instances. Our capacity is slowly decaying as existing services spin down and new EC2 workloads fail to start.


Basic services at my worksite have been offline for almost 8 hours now (things were just glitchy for about 4 hours before that). This is nuts.


Have not gotten a data pipeline to run to success since 9AM this morning when there was a brief window of functioning systems. Been incredibly frustrating seeing AWS tell the press that things are "effectively back to normal". They absolutely are not! It's still a full outage as far as we are concerned.


Yep, confirmed worse - DynamoDB now returning "ServiceUnavailableException"


ServiceUnavailableException hello java :)


Here as well…


Agree… still seeing major issues. Briefly looked like it was getting better but things falling apart again.


I noticed the same thing and it seems to have gotten much worse around 8:55 a.m. Pacific Time.

By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.


SEV-0 for my company this morning. We can't connect to RDS anymore.


Yeah we were fine until about 1030 eastern and have been completely down since then, Heroku customer.


Andy Jassy is the Tim Cook of Amazon

Rest and vest CEOs


Don’t insult Tim Cook like that.

He got a lot of impossible shit done as COO.

They do need a more product minded person though. If Jobs was still around we’d have smart jewelry by now. And the Apple Watch would be thin af.


In addition to those, Sagemaker also fails for me with an internal auth error specifically in Virginia. Fun times. Hope they recover by tomorrow.


Agreed, every time the impacted services list internally gets shorter, the next update it starts growing again.

A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.


Before my old company spun off, we didn’t know the old ops team had put on-prem production and our Atlassian instances in the same NAS.

When the NAS shit the bed, we lost half of production and all our run books. And we didn’t have autoscaling yet. Wouldn’t for another 2 years.

Our group is a bunch of people that has no problem getting angry and raising voices. The whole team was so volcanically angry that it got real quiet for several days. Like everyone knew if anyone unclenched that there would be assault charges.


The problem is now that, what’s anyone going to do? Leave?

I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.

Same meme would work for Aws today.


> Same meme would work for Aws today.

Not really, there are enough alternatives.


How any just run on AWS underneath though?

And it’s not lie there aren’t other brands of chocolate either…


It’s amazing how much you can avoid them by eating food that still looks like what it started as though. They own a lot of processed food.


first time i see "fubar", is that a common expression on the industry? jsut curious (english is not my native language)


It is an old US military term that means “F*ked Up Beyond All Recognition”


FUBAR being a bit worse than SNAFU: "situation normal: all fucked up" which is the usual state of us-east-1


My favorite is JANFU: Joint Army-Navy Fuck-Up.


But you probably have seen the standard example variable names "foo" and "bar" which (together at least) come from `fubar`


Which are in fact unrelated.


Unclear. ‘Foo’ has a life and origin of its own and is well attested in MIT culture going back to the 1930s for sure, but it seems pretty likely that it’s counterpart ‘bar’ appears in connection with it as a comical allusion to FUBAR.


Foobar == "Fucked up beyond all recognition "

Even the acronym is fucked.

My favorite by a large margin...


Interestingly, it was "Fouled Up Beyond All Recognition" when it first appeared in print back towards the end of World War 2.

https://en.wikipedia.org/wiki/List_of_military_slang_terms#F...

Not to be confused with "Foobar" which apparently originated at MIT: https://en.wikipedia.org/wiki/Foobar

TIL, an interesting footnote about "foo" there:

'During the United States v. Microsoft Corp. trial, evidence was presented that Microsoft had tried to use the Web Services Interoperability organization (WS-I) as a means to stifle competition, including e-mails in which top executives including Bill Gates and Steve Ballmer referred to the WS-I using the codename "foo".[13]'


What people would print and what soldiers would say in the 1940s were likely somewhat divergent.


100%


It used to be quite common but has fallen out of usage.


"FUBAR" comes up in the movie Saving Private Ryan. It's not a plot point, but it's used to illustrate the disconnect between one of the soldiers dragged from a rear position to the front line, and the combat veterans in his squad. If you haven't seen the movie, you should. The opening 20 minutes contains one of the most terrifying and intense combat sequences ever put to film.


Honestly not sure if this is a joke I'm not in on.

There are documented uses of FUBAR back into the '40s.


What do you mean? The movie storyline takes place in 44 at the Battle of Normandy.


I must've misread. I thought you said that it comes from the movie rather than comes up in the movie.


FUBAR: Fucked Up Beyond All Recognition

Somewhat common. Comes from the US military in WW2.


Yes, although it's military in origin.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: