Hacker Newsnew | past | comments | ask | show | jobs | submit | chibea's commentslogin

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?


There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.

It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.


I don't remember an event like that, but I'm rather certain the scenario you described couldn't have happened in 2017.

The very large 2017 AWS outage originated in s3. Maybe you're thinking about a different event?

https://share.google/HBaV4ZMpxPEpnDvU9


Sorry the 2015 one. I misremembered the year

https://aws.amazon.com/message/5467D2/

I imagine this was impossible in 2017 because of actions taken after the 2015 incident


Definitely impossible in 2015.

If you're talking about this part:

> Initially, we were unable to add capacity to the metadata service because it was under such high load, preventing us from successfully making the requisite administrative requests.

It isn't about spinning up ec2 instances or provisioning hardware. It is about logically adding the capacity to the system. The metadata service is a storage service, so adding capacity necessitates data movement. There are a lot of things that need to happen to add capacity while maintaining data correctness and availability (mind at this point, it was still trying to fulfill all requests)


I’m referring to impact on other services


When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?

> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.

IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).


Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.


Which is interesting because per their health dashboard,

>We recommend customers continue to retry any failed requests.


They should continue to retry but with exponential backoff and jitter. Not in a busy loop!


If the reliability of your system depends upon the competence of your customers then it isn't very reliable.


Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?

There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.

See e.g., https://d1.awsstatic.com/builderslibrary/pdfs/Resilience-les...


Probably stupid question (I am not a network/infra engineer) - can you not simply rate limit requests (by IP or some other method)?

Yes your customers may well implement stupidly aggressive retries, but that shouldn't break your stuff, they should just start getting 429s?


Load shedding effectively does that. 503 is the correct error code here to indicate temporary failure; 429 means you've exhausted a quota.


Can't exactly change existing widespread practice so they're ready for that kind of handling.


I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.

Both of which seem to prop up in post mortems for these widespread outages.


They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.


It's not a direct dependency. Route 53 is humming along... DynamoDB decided to edit its DNS records that are propagated by Route 53... they were bogus, but Route 53 happily propagated the toxic change to the rest of the universe.

DynamoDB is not going to set up its own DNS service or its own Route 53.

Maybe DynamoDB should have had tooling that tested DNS edits before sending it to Route 53, or Route53 should have tooling to validate changes before accepting them. I'm sure smart people at AWS are yelling at each other about it right now.


Dynamo is AFAIK, not used by core AWS services.


I find it very interesting that this is the same issue that took down GCP recently.


It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...


In that region, other regions are able to launch EC2s and ECS/EKS without a problem.


Is that material to a conversation about service uptime of existing resources, though? Are there customers out there that are churning through the full lifecycle of ephemeral EC2 instances as part of their day-to-day?


any company of non trivial scale will surely launch ec2 nodes during the day

one of the main points of cloud computing is scaling up and down frequently


We spend ~$20,000 per month in AWS for the product I work on. In the average day we do not launch an EC2 instance. We do not do any dynamic scaling. However, there are many scenarios (especially during outages and such) that it would be critical for us to be able to launch a new instance (and or stop/start an existing instance.)


I understand scaling. I’m saying there is a difference in severity of several orders of magnitude between “the computers are down” and “we can’t add additional computers”.


The max-unused percentage feature is well worth it to 80/20 the prune process and only prune the data which is easiest to prune away (i.e. not try to remove small files big packs but focus on packs which have lots of garbage).

In general, there's an unavoidable trade-off between creating many small packs (harder on metadata throughout the system, inside restic and on the backing store but more efficient to prune) versus creating big packs which are more easy on the metadata but might create big repack cost.

I guess a bit more intelligent repacking could avoid some of that cost by packing stuff together that might be more likely to get pruned together.


Isn't he dismissing culture with his rant? Culture as the currently common way to do and describe things?

He seems to imply that there could be an easier, more straight-forward way to describe things in some more common language. And that while he doesn't give any evidence how the current ways are overly complex.

Of course, there is broken or outdated software, and some things were crap from the start. Of course, there are always concrete things to improve but you won't get anywhere by dismissing all of it and starting anew.

For me, understanding the current state as part of our culture and our humanity and improving gradually on it, has guided me well in the past.


That's implication not equation.


Actually, that the malware contained a root exploit is fixable sooner or later.

The next insight will be, that even if the sandbox had worked, this type of attack still is possible by using the user's trust in the brand of a well-known app to use the permissions granted to it for malicious intents. There's no easy way to avoid that up front automatically.


The downside is that a bunch of processes started from one TTY doesn't get as much CPU as before. It basically shifts the scheduling granularity a level higher from processes to (interactive) sessions. Because that's what the question is: on which level do we want to have a fair scheduling? For a desktop user, processes have little meaning. Sessions, instead, are much more useful because they correspond better to his different tasks, for which he expects that the CPU power is distributed in a fair way.


What's the reasoning for your reproduction of the complete text here? With all possibly valid reasoning in this particular case, there should be high barriers for just copying someone else's text, esp. here given its personal nature.


Often, manual allocation is even slower.

http://lambda-the-ultimate.org/node/2552


With only three times as much memory, it runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.

I don't see how this article is helping your point; that's a pretty massive hit. Even worse, this is a GC-style program modified to use explicit malloc/free. Programming with manual memory management encourages a very different style of programming where you try to use malloc/free as little as possible, and you try keep memory for various things in contiguous chunks of memory.

EDIT:

This comment is interesting: http://lambda-the-ultimate.org/node/2552#comment-38915


They may have had Appel's "Garbage Collection Can Be Faster Than Stack Allocation" (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8...) in mind.

Also, the technique described in that comment (forking processes with finite lifetimes, allocating but not freeing, terminating and letting the OS clean up all at once) is also what Erlang does, and it seems to work pretty well in practice. Each process has its own arena, for allocation purposes.


I think it's the discussion that we're meant to read, not just the OP. Most of the comments are pointing out questionable assumptions in the study (e.g. even though they admit that actually changing the program to use manual memory management would be nearly intractable, they assume that essentially running an AOT garbage collector is equivalent).


No no, he's an expert programmer that can do repetitive tasks better than a computer.


An interesting side note here is that the coming Unity will be based upon compiz, instead of relying on mutter. Canonical apparently hired the lead developer of compiz:

http://smspillaz.wordpress.com/2010/10/25/a-bright-new-futur...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: