One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.
It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.
> Initially, we were unable to add capacity to the metadata service because it was under such high load, preventing us from successfully making the requisite administrative requests.
It isn't about spinning up ec2 instances or provisioning hardware. It is about logically adding the capacity to the system. The metadata service is a storage service, so adding capacity necessitates data movement. There are a lot of things that need to happen to add capacity while maintaining data correctness and availability (mind at this point, it was still trying to fulfill all requests)
When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?
> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.
IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).
Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.
Have you ever built a service designed to operate at planetary scale? One that's built of hundreds or thousands of smaller service components?
There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.
I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.
Both of which seem to prop up in post mortems for these widespread outages.
They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.
It's not a direct dependency.
Route 53 is humming along... DynamoDB decided to edit its DNS records that are propagated by Route 53... they were bogus, but Route 53 happily propagated the toxic change to the rest of the universe.
DynamoDB is not going to set up its own DNS service or its own Route 53.
Maybe DynamoDB should have had tooling that tested DNS edits before sending it to Route 53, or Route53 should have tooling to validate changes before accepting them. I'm sure smart people at AWS are yelling at each other about it right now.
It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...
Is that material to a conversation about service uptime of existing resources, though? Are there customers out there that are churning through the full lifecycle of ephemeral EC2 instances as part of their day-to-day?
We spend ~$20,000 per month in AWS for the product I work on. In the average day we do not launch an EC2 instance. We do not do any dynamic scaling. However, there are many scenarios (especially during outages and such) that it would be critical for us to be able to launch a new instance (and or stop/start an existing instance.)
I understand scaling. I’m saying there is a difference in severity of several orders of magnitude between “the computers are down” and “we can’t add additional computers”.
The max-unused percentage feature is well worth it to 80/20 the prune process and only prune the data which is easiest to prune away (i.e. not try to remove small files big packs but focus on packs which have lots of garbage).
In general, there's an unavoidable trade-off between creating many small packs (harder on metadata throughout the system, inside restic and on the backing store but more efficient to prune) versus creating big packs which are more easy on the metadata but might create big repack cost.
I guess a bit more intelligent repacking could avoid some of that cost by packing stuff together that might be more likely to get pruned together.
Isn't he dismissing culture with his rant? Culture as the currently common way to do and describe things?
He seems to imply that there could be an easier, more straight-forward way to describe things in some more common language. And that while he doesn't give any evidence how the current ways are overly complex.
Of course, there is broken or outdated software, and some things were crap from the start. Of course, there are always concrete things to improve but you won't get anywhere by dismissing all of it and starting anew.
For me, understanding the current state as part of our culture and our humanity and improving gradually on it, has guided me well in the past.
Actually, that the malware contained a root exploit is fixable sooner or later.
The next insight will be, that even if the sandbox had worked, this type of attack still is possible by using the user's trust in the brand of a well-known app to use the permissions granted to it for malicious intents. There's no easy way to avoid that up front automatically.
The downside is that a bunch of processes started from one TTY doesn't get as much CPU as before.
It basically shifts the scheduling granularity a level higher from processes to (interactive) sessions. Because that's what the question is: on which level do we want to have a fair scheduling? For a desktop user, processes have little meaning. Sessions, instead, are much more useful because they correspond better to his different tasks, for which he expects that the CPU power is distributed in a fair way.
What's the reasoning for your reproduction of the complete text here?
With all possibly valid reasoning in this particular case, there should be high barriers for just copying someone else's text, esp. here given its personal nature.
With only three times as much memory, it runs on average 17% slower than
explicit memory management. However, with only twice as much memory,
garbage collection degrades performance by nearly 70%. When physical
memory is scarce, paging causes garbage collection to run an order of
magnitude slower than explicit memory management.
I don't see how this article is helping your point; that's a pretty
massive hit. Even worse, this is a GC-style program modified to
use explicit malloc/free. Programming with manual memory management
encourages a very different style of programming where you try to use
malloc/free as little as possible, and you try keep memory for various
things in contiguous chunks of memory.
Also, the technique described in that comment (forking processes with finite lifetimes, allocating but not freeing, terminating and letting the OS clean up all at once) is also what Erlang does, and it seems to work pretty well in practice. Each process has its own arena, for allocation purposes.
I think it's the discussion that we're meant to read, not just the OP. Most of the comments are pointing out questionable assumptions in the study (e.g. even though they admit that actually changing the program to use manual memory management would be nearly intractable, they assume that essentially running an AOT garbage collector is equivalent).
An interesting side note here is that the coming Unity will be based upon compiz, instead of relying on mutter. Canonical apparently hired the lead developer of compiz:
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?