> What does it mean for a system to be distributed? There are two aspects:
> 1. They run on multiple servers. The number of servers in a cluster can vary from as few as three servers to a few thousand servers.
> 2. They manage data. So these are inherently 'stateful' systems.
It's a pity that they don't get to the crux of distributed systems because it's very well defined and described for ~40 years now. Instead they describe the key characteristics of a distributed system in a very hand-wavy manner.
The two fundamental ways in which distributed computing differs from single-server/machine computing are.
1. No shared memory.
2. No shared clock.
Almost every problem faced in distributed systems could be traced to one of these aspects.
Because there's no shared memory it's impossible for any one server to know the global state. And so you need consensus algorithms.
And due to lack of shared clock it's impossible to order the events. To overcome this software logical clock has to be overlaid on top of distributed systems.
Added to this is the failure modes that are peculiar to distributed systems, be it transient/permanent link failures and transient/permaent server failures.
This[1] is a decent description of what I've just described here.
I also recommend to read up on some key impossibility results in distributed system. The most famous one being the impossibility of achiving common knowledge.
I'm surprised that someone as reputed as Thoughtworks don't describe the topic in more precise terms.
I'd rather say "no reliable message delivery". The only difference between completely reliable messaging and shared memory is performance.
> Because there's no shared memory it's impossible for any one server to know the global state.
Even _with_ shared memory it's impossible to _know_ the global state. Just after you've loaded some data from memory, it can be immediately changed by another thread.
I think the distinction he is trying to raise is that messages can be lost in distributed systems. Building distributed shared memory is possible but expensive (readers must write, writers must broadcast). That is why he is raising that distinction and I think it is a good one to raise.
It is not. Reliable: message delivery is guaranteed, i.e., the message _will_ eventually reach its intended destination.
> "Shared memory" implies data coherency.
It does not, re. CPUs with weak memory models. Yes, you can insert barriers, but what happens underneath is that HW exchanges a bunch of messages (reliably!) between nodes to bring them to a consistent view of the global state.
No reliable message delivery, does not help define distributed systems because it can be broken down into a more fundamental elements that are required to guarantee that reliability.
Shared memory works well for me because in my mind shared memory means concurrent processes and memory have a common clock and are connected to the same power supply. This avoids confusion around how one handles some parts running while other parts are not.
Since we are attempting to define distributed systems not "concurrent processing", concurrent processing can be used in the definition and memory safety can be assumed. We can say that shared memory represents a known global state between concurrent processes because that problem is solved by the common clock and the tools that concurrent processes use to synchronize.
> in my mind shared memory means concurrent processes and memory have a common clock
You're aware that modern CPUs clock their cores independently? (E.g., AVX512 throttling.) Hence, no common clock.
> and are connected to the same power supply
That's irrelevant. Unused cores get powered down all the time. On x86, the OS can execute HLT instruction when there's nothing to do and the CPU gets woken up by an interrupt.
> This avoids confusion around how one handles some parts running while other parts are not.
What is a "part"? The system's scheduler can suspend a thread for an indefinite amount of time.
> memory safety can be assumed
What does that even mean?
> We can say that shared memory represents a known global state [...]
Except that it's false, every single point can be countered with a single modern multicore cpu.
On an abstract level, distributed computation is a computation where it's possible that a message does not reach its intended destination within a fixed time bound, for any reason.
If you can guarantee message deliver within an arbitrary, but fixed, time bound, you can simulate shared memory. Yes, performance will suck if the time bound is large, but you have an equivalent system. That's how HW works underneath anyway.
Tip: reading Lamport's papers is illuminating. You can also read about cache coherence protocols (e.g., MESI or MOESI) and you'll see that the illusion of shared memory is fundamentally implemented by message passing on HW level. The illusion is pretty much solid precisely because HW has time bounds on message delivery.
Your second point is moot. Even in a multi threaded single machine program you can load state and have it changed by another thread. That's bad design and not a distributed system characteristic.
Yes, and from this we can conclude that single-core processors are distributed systems too! It's counter-intuitive from an architectural (i.e. teleological) perspective, but it makes perfect sense when you approach it from a suitable level of analysis (intentionality).
Agreed. Giving this summary and then going into details would benefit readers more. AWS's BuilderLibrary, while containing excellent content, also gives an overview of distributed systems that sorts of point this out:
Maybe it's because Fowler's target readers are developers of enterprise software who are not familiar with distributed systems at all, and Fowler's background is not in distributed system either. Therefore, he chose to use colloquial terms.
Isn't it rather "no synchronized data access"? Remote memory isn't a problem if you can read it in a synchronized fashion (taking locks and so on).
And actually "no synchronized information retrieval" is the default even on multithreaded, shared memory systems, which is why they're a lot like distributed systems. You can use mutexes and other synchronization primitives though, to solve some of the problems that just aren't solvable on a computer network, due to much higher latency of synchronization.
You can devise all sorts of distributed system architectures. You could for example have a synchronous system system composed of nodes organized in a ring.
There is not "one definition" of what a distributed system is. You have to define that. There are some common distributed system architectures that perhaps most of us are familiar with--asynchronous networked system, e.g. no shared memory with point-to-point communication. There are other dichotomies; though I'm not an expert in the field and am unable to succinctly define them.
As you add more "stuff" into your distributed system--people talkig about adding a memcached or whatever in other comments, you've introduce a completely different system. Maybe some sort of hybrid. And if you're interested, you can formally reason about its behavior.
Regardless, you have to define what you're talking about.
It's an interesting question to ask what is the most fundamental component of a distributed system? Could it be multiple processing nodes?
I'm not sure that is entirely accurate. If you have a memcache cluster and all data is stored in there, you have shared memory. Albeit slow and subject to atomicity problems, it's still shared state.
It's also a bad idea to rely on that to run your app, so there is that. But it's possible to have shared memory if you're willing to accept the hit to reliability.
Remote memory does not count as shared because of the atomicity problems. The same reason _local_ memory doesn't count as shared the minute you spin up two writer threads with access to the same mutable memory space. (And why Rust is popular for distributed problems that share the same clock)
If you replaced Memcache with a single Redis instance where all operations were managed by Lua scripts (e. g. you introduced atomicity to your operations) you wouldn't have a distributed system, just one with slow, sometimes rather faulty memory.
>It's a pity that they don't get to the crux of distributed systems because it's very well defined and described for ~40 years now.
Really?
I'm incidentally in the midst of a lit-review on the subject and it seems quite apparent that no standard definitions have emerged.
>The two fundamental ways in which distributed computing differs from single-server/machine computing are.
>1. No shared memory. 2. No shared clock.
The typical multiprocessor is, in fact, a distributed system under the hood. Most of the time the programmer is unaware of this thanks to cache coherence algorithms, which in turn benefit from a reliable communication layer between individual cores.
And yet, we can still observe consistency failures when we operate the chip outside of the parameters for which it guarantees a single-system image (namely: when using something like OS threads).
I think the problem is that we're using the wrong kind of definition. Your definitions -- an indeed most definitions encountered in literature, with some exceptions -- appeal to design. They are teleological definitions, and as such they can't define "distributed" in the sense of "distributed programming" or "distributed computation". A more useful kind of definition is intentional [0]. It is constructed at a higher level of analysis that assumes the design serves the purpose of representing the world, among others. Thus, you get a definition like this:
Distributed computing is a computational paradigm in which local action taken by processes on the basis of locally-available information has the potential to alter some global state.
Returning to multiprocessor initial example, the more useful question is often not whether computation is distributed, but when it makes sense to regard it as such. There are three typical cases in which an engineer is engaged in the practice of distributed computing:
1. He is designing or developing a distributed computing system.
2. The system is operating outside of specified parameters, such that design invariants no longer hold.
3. The system is malfunctioning, which is to say it violates its specification despite operating within specified parameters.
The second case is the most relevant to our prototypical multiprocessor. The use of OS threads, for example, can be understood as operating outside of the range of parameters for which the SSI can fulfill its guarantees. It is important to note that the system can still be made to function correctly (contrary to case #3), provided the programmer shoulders the burden of distributed control.
It's definitely possible -- and I would argue, correct -- to reframe "no shared memory" and "no shared clock" in intentional terms, but as we've seen with the multiprocessor example, those two conditions alone do not define "distributed system" in general; they are not fundamental properties. I will however grant that they are the most common manifestations of distribution in practice.
To summarize: the literature has not -- to my knowledge -- arrived at a good definition ~40 years ago. If I've missed something, please point it out, though. I'd hate to publish something incorrect. :)
I think you're both trying to define a general category of things which aren't all the same. The term is just intended to point out that it's a system whose computing is spread out, typically with the intention of building a single computing system that is more powerful than its individual components. It just happens that there's a lot of commonalities between the different "spread out systems", but there's not necessarily some grand unified definition for all of them.
The only thing the average person needs to understand is they're complex and error-prone, and their cost was more justifiable when we really needed to tie hardware together to compute faster. I do think the article should review the popular distributed system patterns of yesteryear; there was a lot of really useful tech that none of the young whipper snappers today have even heard of.
>It just happens that there's a lot of commonalities between the different "spread out systems", but there's not necessarily some grand unified definition for all of them.
I disagree. There is a unifying definition, but it requires moving away from design/architecture words like "spread out" and towards subject-object/representation terms like "local visibility".
>The only thing the average person needs to understand is they're complex and error-prone [...]
Agreed, but the topic was standard (implied: universally correct) definitions.
How does "ephemeral computing" fit into your notion of distributed systems? Perhaps this is a concern not shared by all distributed systems, but it's a practical reality that we in the cloud space have to deal with pervasively and it drives profound architectural differences.
Martin Fowler is in the certification for box checkers business.
99% of the people that read them work at places where they must "move to X" to justify some department. They will likely implement a simulacrum of X (usually by importing some java library someone wrote as homework), adding all the pitfalls and future problems of X with zero of the benefits of X.
Haven’t really read Fowler’s stuff, but I have read Martin Kleppmann’s Designing Data-Intensive Applications and that was helpful. Haven’t seen it mentioned here (though I haven’t looked thoroughly through the comments). Just thought I’d mention it here.
Agreed, Fowler, Martin, etc. are often criticized not because of their work but because of their audience, or more specifically their paying customers. Makes little sense to me, I got a lot out of their writing, especially in the early days of OO.
Fowler and Beck in particular have been massively useful to me recently. Refactoring and TDD by example are wonderful books, and completely changed my approach to software.
I also love Feathers and Working Effectively with Legacy Code, but that might be more of a niche taste ;)
Working Effectively is a bit niche, but unless a team is willing to rewrite from scratch or stumbled on a perfect design, refactoring and dealing with untested code will hit everyone eventually. Hell, test suites may themselves not be sufficiently preserved or remain fully functional after enough time has passed. One of my best time investments was reading that book.
It's been a long time, so I had to ponder this question for a bit.
In context, I was working on precisely the kinds of systems that Feathers is talking about when I read the book. His definition of "legacy" revolves largely around a lack of testing which makes changing the system very difficult (you can't be sure what you've done hasn't broken something else, slows you down to make a proper analysis, or you move fast and almost certainly break things). We had inherited a 500k SLOC embedded system for maintenance, with changes needed, but the previous developers had stripped out all tests when they shipped it. Literally, I found the directories for tests and the test framework and every file was removed. So even if the system wasn't legacy before, it was by the time it got to us.
While I learned a lot, the primary benefit was improving my ability to communicate ideas I already had formed or solidified those ideas in my mind. I had struggled for a long time (I've talked about this in other posts) with managers not seeing the value of spending time building out automated test suites or enforcing good architecture/design, and there's only so much one person can do. Having a more articulate argument was very helpful.
Things I believed before and were reinforced by the book: automate testing as much as possible, unit testing is generally a positive and enabled by good modular design, modular design is critical to keeping a system sustainable over the medium and long term.
I'm looking through the table of contents to make sure what I'm talking about next came from this book and not other readings.
Particularly useful: His discussion of the "seam" model. Which is, in other terms, related to finding module boundaries and dividing the system cleanly into those modules when it wasn't already. If a system wasn't designed with modularity in mind or the clean initial lines weren't maintained, then these seams can be hidden away or blurred. He discusses, early in the book, how to find this, how to refactor to separate these modules, how to develop tests for each module. Looking at the TOC, it looks like chapters 9 and 10 were probably two of the more useful ones for me ("I can't get this class into a test harness" and "I can't get this method in a test harness").
Also, if you're working on a clean slate system this book is still helpful as it can help you keep the pitfalls in mind so you avoid creating another legacy app. A lot of what it discusses is still present in other good books (particularly those with an emphasis on refactoring, like, well, Refactoring and OO/modular design), but in the context of trying to change/improve a system that doesn't have a good (present) architecture or tests to aid you.
One caution, the print copies have gotten worse over the years. Get a digital copy or an old print copy. My office had a printing with about 20 pages repeated on top of 20 missing pages (for example, page 140->141 is actually 140->121 or whatever the numbers were), and when I bought my own, much more recent, hard copy (which I think I left with a coworker at that office when I moved because it's not here at home) it had other, similar, issues. I think I had a whole chapter missing two half chapters repeated in its place. O'Reilly has it in their digital library which I have had access to through most employers so that's where I actually ended up reading most of it.
I'm honestly not sure why not having RabbitMQ/Erlang or X tech listed as an example of Y is such a big deal - especially when the other examples are fine examples.
The target audience seems to be enterprise developers, running JVM applications and needing an introduction to distributed computing. Very unlikely these sorts will jump on the BEAM-wagon.
But completely agree that it should be mentioned, in my view Erlang is probably the most elegant approach to the problem.
Oh, see, I saw that book after watching dozens of dot-com companies re-implemting common patterns (poorly). Basically under-engineering heck. So, once we had the book we could shift the conversation and use common terminology.
The discipline is, of course, knowing which spells to cast and when. And it's super useful to see the scroll of common spells.
A problem in a lot of cases is that for some people, the discipline is about casting spells as opposed to solving problems. It doesn't take a lot of these people to damage projects, but they exist independent of pattern books.
If I was going to complain about engineering hell, it'd be about how shallow we investigate these areas. We stop at the immediate technical events, or inappropriate technology choices, instead of getting to the motivations behind it. This gets into a qualitative area that HN readers tend to want to avoid, but these flaws are with the humans and their circumstances.
I'd argue that the over-engineering hell preceded this book, with every tiny J2EE project using a full 3-tier stack, incl (but not limited to) stateful EJBs, session EJBs, countless JSPs, DAO layers, god only know what else.
It was this book that actually revealed, in patterns, the bigger picture, where it was all going wrong, and the alternatives. This book and Spring v1.0 go along together in my mind.
BTW (and OT), another underrated Fowler book in a similar vein is "Analysis Patterns", another 10k-ft view of common business problems.
I'm a little torn on this one. On the one hand I see it, because it feels like patterns books made the limitations of C++/Java/C# seem respectable, even desirable. Norvig observed this with his talk on patterns in dynamic languages (p. 10 of https://norvig.com/design-patterns/ summarizes this).
On the other hand, I have found his work to be useful in selling approaches to others. There were times where I did a design, then browsed the patterns books to find names for what I was doing to "make it respectable."
I've done the opposite, disguising something that is a named pattern as anything but a factory to prevent structure-avoidant zealots from knee-jerk reacting to the word in a PR.
Usable IoC actually can be implemented in just 4-5 classes. Yes, it won't be as powerful as Guice but for some (or a lot?) of applications it's probably enough.
You don't have to write interfaces to have dependency injection. You can and should declare all your dependencies as constructor parameters. If you create them inside a class you'll permanently lock the two together, thus making it a testability nightmare.
I think this is a good one. Most recent Java/C# systems would have a full IoC container, but have no dynamically selected components (which is how frameworks like Dagger -https://dagger.dev/ - can exist). A lot of runtime reflection/calculation gets done for something that can be known at compile time.
I was exactly the same in my earlier dev days! I'd learn about a pattern, think it was the greatest thing since sliced bread, and see uses for it everywhere... where it really wasn't suitable at all. I was like a builder with only a hammer - everything looked like a nail!
I ended up making a lot of things a lot more complex than they needed to be, or/and a lot less performant than they could be.
At some point, lot's of people seemed to come to the same realisation - worshipping design patterns is a bad idea. Around this time I first heard about "cargo culting".
These days my motto is "no dogma" (well, except about not being dogmatic ;), and I think it serves me well.
I think I know what you're saying. I remember that time, around the birth of Java. But I also remember that we were trying to solve real problems that aren't such big problems anymore: modularization, dependency management, large codebases with crude editing tools, SW delivery and distribution in the age of diskettes!
It turns out that developing large systems in static languages with heavily constrained libraries is difficult. Rapid zero-cost distribution and decentralized storage AND GUIs have really changed the field. Does anyone even call themselves a SW Engineer anymore?
I'll plant a midpost flag and say, while it shouldn't be revered as a bible of eternal truths, it did document and progress the discussion on many things.
I think Fowler does a good job of identifying and classifying things, but that hasn't necessarily made IT / Enterprise all that simpler. What has made "progress" in IT has fundamentally been programmers treating more and more things like code and automating the hell out of everything.
It's a pity that they don't get to the crux of distributed systems because it's very well defined and described for ~40 years now. Instead they describe the key characteristics of a distributed system in a very hand-wavy manner.
The two fundamental ways in which distributed computing differs from single-server/machine computing are.
1. No shared memory. 2. No shared clock.
Almost every problem faced in distributed systems could be traced to one of these aspects.
Because there's no shared memory it's impossible for any one server to know the global state. And so you need consensus algorithms.
And due to lack of shared clock it's impossible to order the events. To overcome this software logical clock has to be overlaid on top of distributed systems.
Added to this is the failure modes that are peculiar to distributed systems, be it transient/permanent link failures and transient/permaent server failures.
This[1] is a decent description of what I've just described here.
I also recommend to read up on some key impossibility results in distributed system. The most famous one being the impossibility of achiving common knowledge.
I'm surprised that someone as reputed as Thoughtworks don't describe the topic in more precise terms.
[1] https://www.geeksforgeeks.org/limitation-of-distributed-syst...