More

chibea · 2025-10-20T11:28:38 1760959718

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?

julianozen · 2025-10-20T22:23:01 1760998981

There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.

It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.

planckscnst · 2025-10-21T21:21:23 1761081683

I don't remember an event like that, but I'm rather certain the scenario you described couldn't have happened in 2017.

The very large 2017 AWS outage originated in s3. Maybe you're thinking about a different event?

https://share.google/HBaV4ZMpxPEpnDvU9

julianozen · 2025-10-22T18:26:43 1761157603

Sorry the 2015 one. I misremembered the year

https://aws.amazon.com/message/5467D2/

I imagine this was impossible in 2017 because of actions taken after the 2015 incident

planckscnst · 2025-10-23T17:15:05 1761239705

Definitely impossible in 2015.

If you're talking about this part:

> Initially, we were unable to add capacity to the metadata service because it was under such high load, preventing us from successfully making the requisite administrative requests.

It isn't about spinning up ec2 instances or provisioning hardware. It is about logically adding the capacity to the system. The metadata service is a storage service, so adding capacity necessitates data movement. There are a lot of things that need to happen to add capacity while maintaining data correctness and availability (mind at this point, it was still trying to fulfill all requests)

julianozen · 2025-10-24T17:55:02 1761328502

I’m referring to impact on other services

cyberax · 2025-10-20T15:49:00 1760975340

When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?

> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.

IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).

cowsandmilk · 2025-10-20T11:41:25 1760960485

Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.

joncrane · 2025-10-20T13:10:03 1760965803

Which is interesting because per their health dashboard,

>We recommend customers continue to retry any failed requests.

otterley · 2025-10-20T15:43:44 1760975024

They should continue to retry but with exponential backoff and jitter. Not in a busy loop!

bcrosby95 · 2025-10-20T16:30:27 1760977827

If the reliability of your system depends upon the competence of your customers then it isn't very reliable.

otterley · 2025-10-20T16:45:15 1760978715

Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?

There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.

See e.g., https://d1.awsstatic.com/builderslibrary/pdfs/Resilience-les...

ifwinterco · 2025-10-20T16:55:23 1760979323

Probably stupid question (I am not a network/infra engineer) - can you not simply rate limit requests (by IP or some other method)?

Yes your customers may well implement stupidly aggressive retries, but that shouldn't break your stuff, they should just start getting 429s?

otterley · 2025-10-20T17:09:05 1760980145

Load shedding effectively does that. 503 is the correct error code here to indicate temporary failure; 429 means you've exhausted a quota.

veltas · 2025-10-20T14:47:09 1760971629

Can't exactly change existing widespread practice so they're ready for that kind of handling.

wwdmaxwell · 2025-10-20T11:55:31 1760961331

I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.

Both of which seem to prop up in post mortems for these widespread outages.

oofbey · 2025-10-20T15:28:34 1760974114

They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.

etc-hosts · 2025-10-21T02:37:30 1761014250

It's not a direct dependency. Route 53 is humming along... DynamoDB decided to edit its DNS records that are propagated by Route 53... they were bogus, but Route 53 happily propagated the toxic change to the rest of the universe.

DynamoDB is not going to set up its own DNS service or its own Route 53.

Maybe DynamoDB should have had tooling that tested DNS edits before sending it to Route 53, or Route53 should have tooling to validate changes before accepting them. I'm sure smart people at AWS are yelling at each other about it right now.

plandis · 2025-10-21T17:49:27 1761068967

Dynamo is AFAIK, not used by core AWS services.

devmor · 2025-10-20T17:39:13 1760981953

I find it very interesting that this is the same issue that took down GCP recently.

chibea · 2025-10-20T11:21:33 1760959293

It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...

rswail · 2025-10-20T11:39:12 1760960352

In that region, other regions are able to launch EC2s and ECS/EKS without a problem.

jamwil · 2025-10-20T14:07:22 1760969242

Is that material to a conversation about service uptime of existing resources, though? Are there customers out there that are churning through the full lifecycle of ephemeral EC2 instances as part of their day-to-day?

shawabawa3 · 2025-10-20T14:48:59 1760971739

any company of non trivial scale will surely launch ec2 nodes during the day

one of the main points of cloud computing is scaling up and down frequently

newtwilly · 2025-10-20T16:00:27 1760976027

We spend ~$20,000 per month in AWS for the product I work on. In the average day we do not launch an EC2 instance. We do not do any dynamic scaling. However, there are many scenarios (especially during outages and such) that it would be critical for us to be able to launch a new instance (and or stop/start an existing instance.)

jamwil · 2025-10-20T15:02:19 1760972539

I understand scaling. I’m saying there is a difference in severity of several orders of magnitude between “the computers are down” and “we can’t add additional computers”.

chibea · on July 30, 2023

The max-unused percentage feature is well worth it to 80/20 the prune process and only prune the data which is easiest to prune away (i.e. not try to remove small files big packs but focus on packs which have lots of garbage).

In general, there's an unavoidable trade-off between creating many small packs (harder on metadata throughout the system, inside restic and on the backing store but more efficient to prune) versus creating big packs which are more easy on the metadata but might create big repack cost.

I guess a bit more intelligent repacking could avoid some of that cost by packing stuff together that might be more likely to get pruned together.

chibea · on Sept 30, 2011

Isn't he dismissing culture with his rant? Culture as the currently common way to do and describe things?

He seems to imply that there could be an easier, more straight-forward way to describe things in some more common language. And that while he doesn't give any evidence how the current ways are overly complex.

Of course, there is broken or outdated software, and some things were crap from the start. Of course, there are always concrete things to improve but you won't get anywhere by dismissing all of it and starting anew.

For me, understanding the current state as part of our culture and our humanity and improving gradually on it, has guided me well in the past.

chibea · on Aug 15, 2011

That's implication not equation.

chibea · on March 2, 2011

Actually, that the malware contained a root exploit is fixable sooner or later.

The next insight will be, that even if the sandbox had worked, this type of attack still is possible by using the user's trust in the brand of a well-known app to use the permissions granted to it for malicious intents. There's no easy way to avoid that up front automatically.

chibea · on Nov 19, 2010

The downside is that a bunch of processes started from one TTY doesn't get as much CPU as before. It basically shifts the scheduling granularity a level higher from processes to (interactive) sessions. Because that's what the question is: on which level do we want to have a fair scheduling? For a desktop user, processes have little meaning. Sessions, instead, are much more useful because they correspond better to his different tasks, for which he expects that the CPU power is distributed in a fair way.

chibea · on Nov 9, 2010

What's the reasoning for your reproduction of the complete text here? With all possibly valid reasoning in this particular case, there should be high barriers for just copying someone else's text, esp. here given its personal nature.

chibea · on Nov 8, 2010

Often, manual allocation is even slower.

http://lambda-the-ultimate.org/node/2552

anon_d · on Nov 8, 2010

With only three times as much memory, it runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.

I don't see how this article is helping your point; that's a pretty massive hit. Even worse, this is a GC-style program modified to use explicit malloc/free. Programming with manual memory management encourages a very different style of programming where you try to use malloc/free as little as possible, and you try keep memory for various things in contiguous chunks of memory.

EDIT:

This comment is interesting: http://lambda-the-ultimate.org/node/2552#comment-38915

silentbicycle · on Nov 8, 2010

They may have had Appel's "Garbage Collection Can Be Faster Than Stack Allocation" (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8...) in mind.

Also, the technique described in that comment (forking processes with finite lifetimes, allocating but not freeing, terminating and letting the OS clean up all at once) is also what Erlang does, and it seems to work pretty well in practice. Each process has its own arena, for allocation purposes.

chc · on Nov 8, 2010

I think it's the discussion that we're meant to read, not just the OP. Most of the comments are pointing out questionable assumptions in the study (e.g. even though they admit that actually changing the program to use manual memory management would be nearly intractable, they assume that essentially running an AOT garbage collector is equivalent).

jrockway · on Nov 8, 2010

No no, he's an expert programmer that can do repetitive tasks better than a computer.

chibea · on Oct 26, 2010

An interesting side note here is that the coming Unity will be based upon compiz, instead of relying on mutter. Canonical apparently hired the lead developer of compiz:

http://smspillaz.wordpress.com/2010/10/25/a-bright-new-futur...