Your example is a conceptually simple filter on a single list of items. But once the chain grows too long, the conditions become too complex, and there are too many lists/variables involved, it becomes impossible understand everything at once.
In a procedural loop, you can assign an intermediate result to a variable. By giving it a name, you can forget the processing you have done so far and focus on the next steps.
You don't ever need to "understand everything at once". You can read each stanza linearly. The for loop style is the approach where everything often needs to be understood all at once since the logic is interspersed throughout the entire body.
This. I teach this with Pandas (and Polars) all the time. You don't really care about the intermediate values. You build up the chain operation by operation (validating that it works). At the end you have a recipe for processing the data.
Most professional Pandas users realize that working with chains makes their lives much easier.
By the way, debugging chains isn't hard. I have a chapter in my book that shows you how to do it.
In the example above, you first have a list of books. Then you filter it down to books with >1000 pages. Then you map it to authors of books with >1000 pages. Then you collapse it to distinct authors of books with >1000 pages. Every step in the chain adds further complexity to the description of the things you have, until it exceeds the capacity of your working memory. Then you can no longer reason about it.
The standard approach to complexity like that is to invent useful concepts and give them descriptive names. Then you can reason about the concepts themselves, without having to consider the steps you used to reach them.
Folks who are familiar with chaining don't think about it in the way that you've presented. If you're familiar, it's more like:
Filter to the books with >1000 pages
Then their authors.
Finally, distinguish those authors.
If you're familiar, you don't mentally represent each link in the chain as the totality of everything that came before it _plus_ whatever operation you're doing now. You consider each link in the chain in isolation, as its inputs are the prior link and its outputs will be used in the next link. Giving a name to each one of those links in the chain isn't always necessary, and depending on how trivial the operations are, can really hurt readability.
The problem with that is that it's all implicit. If the steps are sufficiently complex and if you don't already know what the code is doing, you don't always have a clear mental image of what the intermediate state after each step is supposed to represent. And with a chained syntax like that, you don't have an option to give the intermediate state an explicit name. A name that could help the reader understand what is going on.
You don't have to give a name to every intermediate state, just like you don't have to comment every single line of code. But sometimes the names and comments do improve readability.
The problem is the "you" in question is not always able to. When "you" write code it makes sense and so you don't need to assign many names. The you in six months will want more names, and in 6 years that will be different again (how many depends - if this code is changed often then you know it much better than if it has been stable). The worse case will be after you "get hit by a bus" and the "you" in question is some poor person who has never seen this code before.
Unlike the procedural approach, every step in a functional chain is wholly isolated and independent from the others. It is strictly easier to split this style of code up into two halves and name them than it is to disentangle procedural equivalents.
I have quite literally zero times in my ~25 year career had to deal with some sort of completely inscrutable chain of functional calls on iterators. Zero. I am entirely convinced that the people arguing against this style have never actually worked in a project where people used this style. It's okay! The first time I saw these things I, too, was terribly confused and skeptical.
I will admit to not having written any significant functional code. However the poster child for functional programming always seems to be small programs (xmonad is the largest one I can think of, and the procedural counterparts are not that big either. Of course there is a lot of code out there that nobody can talk about). Thus I have to conclude the question of how that style scales to really large programs remains open.
That said, you didn't address my comment at all. It might be easier, but that doesn't mean it is easy to figure out what that long chain is really done - all too often the algorithm names don't tell you what you are really trying to accomplish in my experience.
Ahh, it gets really interesting when you read code that does have named variables… and they’re misleading.
A strength of functional idioms is that they expose the structure of the code in a way that a name - even a well chosen name - can only hope to achieve. Often, succinctly and comprehensively. At that point you stop caring so much about variable names. They’re still there but you need them less
You have literally just described the set of objects asked for: the unique authors of the books with more than 1,000 pages. I don't understand how you expect to get any simpler than that. The functional style isn't even requiring you to describe how to accomplish it, it almost verbatim simply describes the answer you're trying to get.
If your entire objection is that you might want intermediate-named variables… you can just do that?
var longBooks = books.filter(book => book.pageCount > 1000)
var authors = longBooks.map(book => book.author)
var distinctAuthors = authors.distinct()
For short chains (95%+ of cases), this is far more mental overhead. For the remaining cases, you can just name the parts? I'm just completely failing to see your problem here.
The problem is that it's easy to overdo it. When you are writing the code, you already know what it's supposed to do, and adding a few more things to the chain is convenient and attractive. But when you are reading unfamiliar code, you often wish that the author was more explicit with their code. Not just with what the code is actually doing, but what it's trying to do and what are the key waypoints to get there.
With procedural code, it's widely accepted that you should not do too many things in a single statement. But in functional code, the entire chain is a single statement. There are no natural breakpoints where the reader could expect to find justifications for the code.
> But in functional code, the entire chain is a single statement. There are no natural breakpoints where the reader could expect to find justifications for the code.
How are we deciding what's "functional code", here? Because functional languages also provide means like `let` and `where` bindings to break up statements. The example might in pseudo-Haskell be broken up like
IMO the code here is also simple enough that I don't see it needing much in the way of comments, but it is also possible and common to intersperse comments in the dot style, e.g.
distinctAuthors = books // TODO: Where does this collection come from anyway?
// books are officially considered long if they're over 1000 pages, c.f. the Council of Chalcedon (451)
.filter(book => book.pageCount > 1000)
// All books have exactly one author for some reason. Why? Shouldn't this be a flatmap or something?
.map(book => book.author)
// We obviously actually want a set[author] here, rather than a pruned list[author],
// but in this imaginary DinkyLang we'd have to implement that as map[author, null]
// and that's just too annoying to deal with
.distinct()
If you're used to it, then it doesn't read like a single statement, even though technically it is. You put each call of the chain on its own line and it feels like reading the lines of regular imperative code. Except better because I can be sure that each line strictly only uses the result of the previous line, not two or three lines before so the logic flows nicely linearly.
Welcome to all features of every programming language?
Sacrificing readability, optimization, and simplicity for the 95% case because some un-principled developers might overdo it in the 5% case (when the cost of fixing it is trivially just inserting variable assignments) is… not a good trade-off.
5% is common enough that you'll encounter it almost every time you read code. And fixing it is not easy, because you first need to understand the code before you can add useful variable names.
Besides, programming language evolution is mostly driven by the fact that everyone is lazy and unprincipled at least occasionally. If you need to be disciplined to avoid footguns, you'll trigger them sooner or later.
The cost of this "footgun" is basically zero. Every step in a functional pipeline is isolated and wholly independent. If you want to split such a pipeline in two, doing so is trivial.
Your example shows that it is possible to give NAMES to the intermediate results of a long chain.
Giving names to things makes it easier to understand the intention of the programmer.
And that also allows you to create a TREE of dataflow-code not just a CHAIN. For instance 'longBooks' could be used as the starting point of multiple different chains.
It gets complicated at some point but I think other approaches result in code that is even harder to understand.
How so? The states of the intermediate steps are logically and easily exposed in a debugger. You can also easily set conditional breakpoints relative to the intermediate states.
I know that intermediate states are generally easier to comprehend, because I never have to explain them in code reviews. To avoid having to explain chains to others, I end up having to add descriptive comments to the intermediate steps, far exceeding the number of characters the descriptive intermediate variables would take. That's why I avoid them, or break them up: time spent in code reviews has proven to me that people have trouble with chains.
Build up and debug the chain as you work in an environment like Jupyter. No need to create variables. Just run the code and verify that the current step works. Then, proceed to the next. Then, put the chain in a function. If you want to be nice, put a .loc as the first step to explicitly list all of the input columns. Drop another .loc as the last step to validate the output columns. (This also serves as a test and documentation to future you about what needs to come in and out.) Create a simple unit test with a sample of the data if you desire.
I've found that the constraint of thinking in chains forces me to think of the recipe that I need for my data. Of course, not everything can be done in a stepwise manner (.pipe helps with that), but often, this constraint forces you to think about what you are doing.
Every good Pandas user I know uses it this way. I've taught hundreds more. Generally, it feels weird at first (kind of like whitespace in Python), but after a day, you get used to it.
Most languages don't expose the internal of map to set a breakpoint, so you're left with individual entities. But yes, there are tricks you can use to make it work, although most require more complex conditional/sequential breakpoints. In your method breakpoint example, you would need to set a chained breakpoint, as in "don't break until this other breakpoint above the chain has been hit", otherwise the breakpoint in the method won't be "spatially" relevant to the code you're debugging.
Each predicate is a separate scope. How is the complexity additive? If you really have to you can simply be just as specific in your predicate naming as you would in a for loop.
That's only true for casual reviewing and writing.
When you're actually analyzing a bug, or need to add a new feature to the code... Then you'll have to keep the whole thing in your mind. No way around it
It gets extra annoying when people have complex maps, reduces, flat maps all chained after the next, and each step moved into a named function.
HF constantly jumping around trying to rationalize why something happens with such code...
It looks good on first glance, but it inevitably becomes a dumpster fire as soon as you need to actually interact with the code.
In a practical example you'd create a named intermediate type which becomes a new base for reasoning. Once you convinced yourself that the first part of the chain responsible for creating that type (or a collection of it) is correct, you can forget it and free up working memory to move on to the next part. The pure nature of the steps also makes them trivially testable as you can just call them individually with easy to construct values.
In a procedural loop, you can assign an intermediate result to a variable. By giving it a name, you can forget the processing you have done so far and focus on the next steps.