> 2017’s Attention is All You Need was groundbreaking and paved the way for ChatGPT et al. Since then ML researchers have been trying to come up with new architectures, and companies have thrown gazillions of dollars at smart people to play around and see if they can make a better kind of model. However, these more sophisticated architectures don’t seem to perform as well as Throwing More Parameters At The Problem. Perhaps this is a variant of the Bitter Lesson.
This is not true and unfortunately this significantly reduced the credibility of this article for me. Raw parameter counts stopped increasing almost 5 years ago, and modern models rely on sophisticated architectures like mixture-of-experts, multi-head latent attention, hybrid Mamba/Gated linear attention layers, sparse attention for long context lengths, etc. Training is also vastly more sophisticated.
The Bitter Lesson is misunderstood. It doesn't say "algorithms are pointless, just throw more compute at the problem", it says that general algorithms that scale with more compute are better than algorithms that try to directly encode human understanding. It says nothing about spending time optimising algorithms to scale better for the same compute, and attention algorithms and LLMs in general have significantly advanced beyond "moar parameters" since the time of Attention is All You Need/GPT2/GPT3.
Literally the paragraph right before the one you quote is this:
> I am generally outside the ML field, but I do talk with people in the field. One of the things they tell me is that we don’t really know why transformer models have been so successful, or how to make them better. This is my summary of discussions-over-drinks; take it with many grains of salt. I am certain that People in The Comments will drop a gazillion papers to tell you why this is wrong.
As I understand it, this article is basically a conglomeration of several attempts at an article that the author has attempted to make over the past decade or so considering the impacts of AI on society. In their own words:
> Some of these ideas felt prescient in the 2010s and are now obvious. Others may be more novel, or not yet widely-heard. Some predictions will pan out, but others are wild speculation. I hope that regardless of your background or feelings on the current generation of ML systems, you find something interesting to think about.
As for the "Bitter Lesson" part, they pretty much directly said that it wasn't the Bitter Lesson exactly, saying it might be a variant of it. Honestly, it felt more like a way of throwing in a reference to something that also might provoke thought, which was done throughout the piece (which again, is the entire point).
It's totally valid to say "this article didn't provoke much thought for me". I'm a bit confused at why you think a lack of specific domain knowledge in a domain that they literally state they are not an expert in would be disqualifying for that purpose though.
The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?
If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?
> The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?
Maybe some, but not that much given the disclaimers I cited above. There's value in a qualitative confidence level for a statement, and I'd argue that this is something that LLMs do not seem to produce in practice without someone explicitly asking for it. The human author's ability to anticipate potential mistakes in their logic and communicate those ahead of time is not equivalent to the type of fabrications that LLMs routinely make.
> If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?
I don't know why an expert in LLM implementation would be inherently more qualified to analyze the second-order effects of their product than anyone else. There's precedent for people who are "too close" to something having biases that make them less effective at recognizing how tools will get used by non-experts, and society as a whole is largely composed of people who are not experts in LLM implementations. If you wanted to understand what the net effect of everyone having access to LLMs, having an understanding of people is probably more important than knowing exactly what an LLM does under the hood.
Might the conclusions be correct even if some of the facts are not? Even a stopped clock is right twice a day. And, "approximately correct" is still sometimes valuable.
The most obvious reason is that transformers accept a sequence as an input and produce a sequence as an output. The vast majority of pre-transformer architectures only accepted a fixed input and output size. Before 2016 I was somewhat interested in ML, but my curiosity vanished because of the fixed input and output size limitations.
RNNs including LSTMs at the time were pretty bad and difficult to train due to vanishing and exploding gradients at long sequence lengths and sequential training along the sequence length. Meanwhile transformers can be parallelized along the sequence length.
Then there are theoretical limitations. Transformers re-read the entire sequence for every output. This leads to quadratic attention. There are plenty of papers that tell you why it is impossible to replicate the properties of quadratic attention with linear attention.
The reason is blatantly obvious. If you want linear attention to have the same capability, you need to re-read the entire input sequence after every output. If you do this at the token level, then you have basically implemented quadratic attention.
Transformers aren't a mystery success, they are using computational brute force, which is hard to beat with other architectures. If you go with a more efficient architecture, you are by definition giving up some non-zero capabilities. Nobody really cares about getting slightly worse results from a much more efficient architecture. In the current ML space, it's SOTA (state of the art) or go home.
Transformers do have a fixed input/output size though - that's what a context window is. It's just that, via scaling and algorithmic improvements, the length of usable context windows has increased to the point that they're much less of a bottleneck.
I think your points around parallelisation and the flexibility of quadratic attention are spot-on though.
transformers have a fixed input size (padding the unneeded context window with null tokens). Whether you put in a sequence of things or just random tokens is irrelevant. To the network it is just "one input"
They also have a fixed output of one probability distribution for the next one token.
running it in a loop does not mean it can work with sequences, by that definition, so can literally everything else
Sorry but that's false, you are confusing transformers as an architecture, and auto-regressive generation, and padding during training.
Standard transformers take in an arbitrary input size and run blocks (self and possibly cross attention, positional encoding, MLPs) that don't care about its length.
> They also have a fixed output of one probability distribution for the next one token.
No, in most implementations, they output a probability distribution for every token in the input. If you input 512 tokens, you get 512 probability distributions. You can input however many tokens you want - 1, 2048, one million, it's the same thing (although since standard self-attention scales quadratically you'll eventually run out of memory). Modern relative embeddings like RoPE can support infinite length although the quality will degrade if you extrapolate too far beyond what the model saw during training.
For typical auto-regressive generation, they are trained with causal masking/teacher forcing, which makes it calculate the probability for the next token. During inference, you throw away all but the last probability distribution and use that to sample the next token, and then repeat. You also do this with an RNN. An autoregressive CNN (e.g. WaveNet) would be closer to what you described in that it has a fixed window looking backwards.
But a transformer doesn't have to be used for auto-regressive generation, you can use it for diffusion, as a classifier model, for embedding text. It doesn't even see a sequence as spatially organised - unlike a CNN or an RNN it doesn't have architectural intrinsic biases about the position of elements, which is why it needs positional embeddings. This lets you have 2D, 3D, 4D, or disordered elements in a sequence. You can even have non-regularly sampled sequences. (Again this is for a classic transformer without sliding window attention or any other special modifications).
> (padding the unneeded context window with null tokens).
To have efficient training, you pad all samples in a batch to have the same length (and maybe make it a power of two). But you are working with a single sequence, the length is arbitrary up to hardware limitations, and no padding is needed.
The network has a fixed number of input neurons. You have to put something in all of them.
If you enter "hello", the network might get " hello", but all of its inputs need some inputs. It doesn't (and can't) process tokens one at a time.
"No, in most implementations, they output a probability distribution for every token in the input."
A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.
In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".
Not to be rude, but you're arguing with a machine learning engineer about the basics of neural network architectures :P
> The network has a fixed number of input neurons. You have to put something in all of them.
The way transformers work is that they apply the same "input neurons" to each individual token! It's not:
Token 1 -> Neuron 1
Token 2 -> Neuron 2
Token 3 -> Neuron 3...
With excess neurons not being used, it's
Token 1 -> Vector of dimensions N -> ALL neurons
Token 2 -> Vector of dimensions N -> ALL neurons
Token 3- > Vector of dimensions N -> ALL neurons
...
Grossly oversimplified, in a typical transformer layer, you have 3 distinct such "networks" of neurons. You apply them each token, giving you, for each token, a "query", a "key", and a "value". You take the dot product of there query and key, apply softmax, then multiply it with the value, giving you the vector to input for the next layer.
A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.
In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".
Not quite, the reason transformers train fast is because you can train on all columns at once.
For tokens 1, 2, 3, 4, ... you get predictions for tokens 2, 3, 4, 5... Typical autoregressive transformer training uses a causal mask, so that token 1 doesn't see token 2, enabling you to train on all the predictions at once.
>Raw parameter counts stopped increasing almost 5 years ago, and modern models rely on sophisticated architectures like mixture-of-experts, multi-head latent attention, hybrid Mamba/Gated linear attention layers, sparse attention for long context lengths, etc.
Agree, I recently updated our office's little AI server to use Qwen 3.5 instead of Qwen 3 and the capability has considerably increased, even though the new model has fewer parameters (32b => 27b)
Yesterday I spent some time investigating it:
- Gated DeltaNet (invented in 2024 I think) in Qwen3.5 saves memory for the KV kache so we can afford larger quants
- larger quants => more accurate
- I updated the inference engine to have TurboQuant's KV rotations (2026) => 8-bit KV cache is more accurate
Before, Qwen3 on this humble infra could not properly function in OpenCode at all (wrong tool calls, generally dumb, small context), now Qwen 3.5 can solve 90% problems I throw at it.
All that thanks to algorithmic/architectural innovations while actually decreasing the parameter count.
I agree the original poster exaggerated it. But generally models indeed have stopped growing at around 1-1.5 trillion parameters, at least for the last couple of years.
>Even now, I don't know if parameter count stopped mattering or just matters less
Models in the 20b-100b range are already very capable when it comes to basic knowledge, reasoning etc. Improving the architecture, having better training recipes helped decrease the required parameter count considerably (currently 8b models can easily beat the 175b strong GPT3 from 3 years ago in many domains). What increasing the parameter count currently gives you is better memorization, i.e. better world knowledge without having to consult external knowledge bases, say, using RAG. For example, Qwen3.5 can one-short compilable code, reason etc. but can't remember the exact API calls to to many libraires, while Sonnet 4.6 can. I think what we need is split models into 2 parts: "reasoner" and "knowledge base". I think a reasoner could be pretty static with infrequent updates, and it's the knowledge base part which needs continuous updates (and trillions of parameters). Maybe we could have a system where a reasoner could choose different knowledge bases on demand.
5 years ago was the beginning of 2021, just under a year after GPT3 was released (which was not good at doing anything useful). And that model was 175B params.
GPT4 has been widely rumored to have 1.8 trillion params, which is 10x more, and was released 2 years after this "5 years ago" date that you are using here.
So, to quote yourself here, "This is not true and unfortunately this significantly reduced the credibility of this article for me" /s/article/comment
In late 2021, GLaM had 1.2T parameters. It's difficult to find much use of it in the wild and while the benchmarks it uses are rather outdated, it has a HellaSwag score of 76.6% and WinoGrande of 73.5%. GPT3 had 64.3% and 70.2%.
Meanwhile, Gemma 2 9B, a model from July 2024 with 133x fewer parameters than GLaM, scores 82% and 80.6%. Hellaswag and WinoGrande aren't used in modern benchmarks, probably because they're too easy and largely memorised at this point.
And GPT-4 had 1.8T parameters sure, but it's noticeably worse than any modern model a fraction of the size, and the original incarnation was ridiculously expensive per token. And in any case, its number of parameters was only possible due using mixture-of-experts, which I would definitely classify as a sophisticated architecture as opposed to just throwing more parameters at a vanilla transformer. Even in 2021 GLaM was a MoE because the limits of scaling dense transformers had already been hit.
MoE has made it vastly easier to increase total parameters (and recent open models are really quite large) but it's also hard to compare a MoE with an earlier dense model.
Yeah I also came here to be one of those People In The Comments the author refers to.
Transformers are not magical. They are just a huge improvement over other architectures at the time such as LSTMs and RNNs and even CNNs. They allowed us to throw more and more compute at the problem of next token prediction. And we’ve been riding that horse ever since.
Another big advancement that deserves mentioning is “reasoning” models that have the opportunity to spit out thinking tokens before giving a final answer.
None of this is to say transformers are the most principled approach. But they work.
Transformers' greatest improvement over RNN/LSTM was to enable better parallelization of large-scale training. This is what enabled language models to become "large". But when controlling for overall size, more RNN/LSTM-like approaches seem to be more efficient, as seen e.g. in state space models. The transformer architecture does add some notable capabilities in accounting for long-range dependencies and "needle in a haystack" scenarios, but these are not a silver bullet; they matter in very specific circumstances.
With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths. Dot-product attention has better performance in a number of domains however (especially for exact retrieval) so the best architectures are likely to remain hybrid for now.
>With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths.
That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well. None of the big labs seem to be bothered with hybrid approaches.
> That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well.
SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.
But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.
> None of the big labs seem to be bothered with hybrid approaches.
Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.
>SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.
Removing the non-linearity from the recurrence path is exactly what constitutes a "pretty big architectural divergence." A linear RNN is an RNN in a structural sense, certainly, but functionally it strips out the non-linear state transitions that made traditional LSTMs so expressive, entirely to enable associative scans. The inductive bias is fundamentally altered. Calling that simply 'modern training techniques' is disingenous at best.
>But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.
That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.
>Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.
I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.
> That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.
What exactly makes you so confident?
The world is not just labs that can afford billion dollar datacentres and selling access to SOTA LLMs at $30/Mtokens. Transformers are highly unsuitable for many applications for a variety of reasons and non-linear RNNs trained via parallel methods are an extremely attractive value proposition and will likely feature in production in the next products I work on.
> I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.
See the Qwen3.5 Huggingface description: https://huggingface.co/Qwen/Qwen3.5-27B
> Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
The dotcom bubble burst and 26 years later we’re all hopelessly addicted to the internet and the top companies on the stock market are almost all what would have been called “dotcoms” then.
The railroad bubble burst in 1846 not because trains were a dead end - passenger number would increase more than 10x in the UK in the following 50 years.
It’s absolutely not winner take all. LLMs have become a commodity and the cost of switching models is essentially nil.
Even if ChatGPT has brand recognition amongst lay people, your grandparents aren’t the ones shelling out $200/mo for a Claude code subscription and paying for extra Opus tokens on top of that. Anthropic’s revenue is now neck and neck with OpenAI, but if tomorrow they increased the price of Opus by 5x without increasing its capabilities, many would switch to Gemini, GPT 5.4, Cursor, or any cheap Chinese model. In fact I know many engineers that have multiple subscriptions active and switch when they hit the rate limits of one, precisely the tools are so interchangeable.
At some point it could even become cheaper to just buy 8x H100s and host Qwen/Deepseek/Kimi/etc yourself if you’re one of those companies paying $3k/mo per engineers in tokens.
I have non-tech friends telling me about preferring other models like gemini, this feels like the early days of search engines when people were willing to switch to find better results.
Yep i have nontech friends and even the younger generation students talking about how Claude is better at certain tasks or types of homework problems lol.
If it's used as a tool not just search, then people will definitely talk about the other stuff. Students who rely on free tiers will also definitely just have everything bookmarked.
> What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.
I don't see the problem if you give the LLM the ability to generate multiple search queries at once. Even simple vector search can give you multiple results at once.
> "How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"
I'm a human and I have a hard time parsing that query. Are you asking only for Mercedes E-Class? The number of cars, as in how many were sold?
I agree with you that simple vector search + context stuffing is dead as a method, but I think it's ridiculous to reserve the term "RAG" for just the earliest most basic implementation. The definition of Retrieval Augmented Generation is any method that tries to give the LLM relevant data dynamically as opposed to relying purely on it memorising training data, or giving it everything it could possibly need and relying on long context windows.
The RAG system you mentioned is just RAG done badly, but doing it properly doesn't require a fundamentally different technique.
> it's ridiculous to reserve the term "RAG" for just the earliest most basic implementation
Whether we like it or not, dumb semantic search became the colloquial definition of RAG.
And when you hear someone saying "we use RAG here" 95% of the time this is exactly what they mean.
When you inject user's name into the system prompt, technically you're doing RAG - but nobody thinks about it that way. I think it's one of those case where colloquial definition is actually more useful that the formal one.
> doing it properly doesn't require a fundamentally different technique
Then what do you call RAG done well? You need a term for it.
> And when you hear someone saying "we use RAG here" 95% of the time this is exactly what they mean.
That's just Sturgeon's law in action. 95% of every implementation is crap. Back in the 90s, you might have heard "we use OOP here" and come to a similar conclusion, but that doesn't mean you need to invent a new word for doing OOP properly.
> But agentic RAG is fundamentally different.
From an implementation POV, absolutely not.
I've personally gradually converted a dumb semantic search to a more fully featured agentic RAG in small steps like these:
- Have a separate LLM call write the query instead of just using the user's message.
- Make the RAG search a synthetic injected tool call, instead of appending it to the system prompt.
- Improve the search endpoint by using an LLM to pre-process the data into structured chunks with hierarchical categories, tags, and possible search queries, embedding the search queries separately from the desired information (versus originally just having a raw blob).
- Have the LLM be able to search both with a semantic sentence, and a list of tags.
- Have the LLM view and navigate the hierarchy in a tree-like manner.
- Make the original LLM able to call the search on its own instead of being automatically injected using a separate query rewriting call, letting it search in multiple rounds and refine its own queries.
When did the system go from RAG to "not RAG"? Because fundamentally, all you need to do to make an agentic RAG is to have the LLM be able to write/rewrite its own search queries (possibly in multiple passes) as opposed to just passing the user's messages(s) directly.
I like the audacity of parent poster that equates 95% of implementations he has seen with 95% of all there is. When it easily could have been 0.01% of all there is. World is much bigger than we think :)
>all you need to do to make an agentic RAG is to have the LLM be able to write/rewrite its own search queries (possibly in multiple passes)
I think this is a huge oversimplification, the term "search query" is doing a lot of heavy lifting here.
When Claude Code calls something like
find . -type d -maxdepth 3 -not -path '*/node_modules/*'
to understand the project hierarchy before doing any of the grep calls, I don't think it's fair to call it just a "search query", it's more like "analyze query". Just because text goes in and out in both cases, doesn't mean that it's all the same.
When you give the agent the ability to query the nature of the data (e.g. hierarchy), and not just data itself, it means that you need to design your product around it. Agentic RAG has entirely different implementation, product implications, cost, latency, and primarily, outcomes. I don't think it's useful to pretend that it's just a different flavor of the same thing, simply because at the end of the day it's just some text flying over the network.
Some previous techniques for RAG, like directly using a user message’s embedding to do a vector search and stuffing the results in the prompt, are probably obsolete. Newer models work much better if you use tool calls and let them write their own search queries (on an internal database, and perhaps with multiple rounds), and some people consider that “agentic AI” as opposed to RAG. It’s still augmenting generation with retrieved information, just in a more sophisticated way.
Mamba doesn't assume auto-regressive decoding, and you can use absolutely use it for diffusion, or pretty much any other common objective. Same with a conventional transformer. For a discrete diffusion language model, the output head is essentially the same as an autoregressive one. But yes, the training/objective/inference setup is different.
You don't even need to go into the pipeline details. The 9800X3D has 8x more L2 cache, 6x more L3 cache, 2x the memory bandwidth than the now 8 years old i9 9900K. 3D V-cache is pretty cool.
This is not true and unfortunately this significantly reduced the credibility of this article for me. Raw parameter counts stopped increasing almost 5 years ago, and modern models rely on sophisticated architectures like mixture-of-experts, multi-head latent attention, hybrid Mamba/Gated linear attention layers, sparse attention for long context lengths, etc. Training is also vastly more sophisticated.
The Bitter Lesson is misunderstood. It doesn't say "algorithms are pointless, just throw more compute at the problem", it says that general algorithms that scale with more compute are better than algorithms that try to directly encode human understanding. It says nothing about spending time optimising algorithms to scale better for the same compute, and attention algorithms and LLMs in general have significantly advanced beyond "moar parameters" since the time of Attention is All You Need/GPT2/GPT3.
reply