Transformers Are Bayesian Networks

warypet · 2026-03-25T02:48:00 1774406880

I found this earlier today when looking for research and ended up reporting it for citing fake sources. Please correct me if I'm wrong, but I couldn't find "[9] Jongsuk Jung, Jaekyeom Kim, and Hyunwoo J. Choi. Rethinking attention as belief propagation. In International Conference on Machine Learning (ICML), 2022." anywhere else on the internet

kurthr · 2026-03-25T03:19:50 1774408790

Yep, nothing by even a subset of those authors. Closest paper from that Conference:

Rethinking Attention-Model Explainability through Faithfulness Violation Test Yibing Liu, Haoliang Li, Yangyang Guo, Chenqi Kong, Jing Li, Shiqi Wang

https://proceedings.mlr.press/v162/liu22i.html

https://icml.cc/virtual/2022/spotlight/18082

measurablefunc · 2026-03-25T03:08:03 1774408083

It's "vibe" research. Most of it is basically pure nonsense.

kleiba · 2026-03-25T03:22:07 1774408927

Care to elaborate?

refulgentis · 2026-03-25T03:38:01 1774409881

The headline theorem, "every sigmoid transformer is a Bayesian network," is proved by `rfl` [1]. For non-Lean people: `rfl` means "both sides are the same expression." He defines a transformer forward pass, then defines a BP forward pass with the same operations, wraps the weights in a struct called `implicitGraph`, and Lean confirms they match. They match because he wrote them to match.

The repo with a real transformer model (transformer-bp-lean) has 22 axioms and 7 theorems. In Lean, an axiom is something you state without proving. The system takes your word for it. Here the axioms aren't background math, they're the paper's claims:

- "The FFN computes the Bayesian update" [2]. Axiom.

- "Attention routes neighbors correctly" [3]. Axiom.

- "BP converges" [4]. Axiom, with a comment saying it's "not provable in general."

- The no-hallucination corollary [5]. Axiom.

The paper says "formally verified against standard mathematical axioms" about all of these. They are not verified. They are assumed.

The README suggests running `grep -r "sorry"` and finding nothing as proof the code is complete. In Lean, `sorry` means "I haven't proved this" and throws a compiler warning. `axiom` also means "I haven't proved this" but doesn't warn. So the grep returns clean while 22 claims sit unproved. Meanwhile the godel repo has 4 actual sorries [6] anyway, including "logit and sigmoid are inverses," which the paper treats as proven. That same fact appears as an axiom in the other repo [7]. Same hole, two repos, two different ways to leave it open.

Final count across all five repos: 65 axioms, 5 sorries, 149 theorems.

Claude (credited on page 1) turned it into "Not an approximation of it. Not an analogy to it. The computation is belief propagation." Building to a 2-variable toy experiment on 5K parameters presented as the fulfillment of Leibniz's 350-year-old dream. Ending signed by "Calculemus."

[1] https://github.com/gregorycoppola/sigmoid-transformer-lean/b...

[2] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[3] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[4] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[5] https://github.com/gregorycoppola/transformer-bp-lean/blob/7...

[6] https://github.com/gregorycoppola/godel/blob/bc1d138/Godel/O...

[7] https://github.com/gregorycoppola/sigmoid-transformer-lean/b...

kleiba · 2026-03-25T20:35:49 1774470949

Thanks for writing such an elaborate reply! I wish I was familiar with Lean, so I could follow. But if you're right, it would put the claims of the paper in a totally different light.

Perhaps others with knowledge in Lean could also chime in?

refulgentis · 2026-03-26T19:30:24 1774553424

Doubtful:

- articles two days old

- I got links right to the code

- its clearly a waste of time if you know Lean, I went way above and beyond already

Maybe if you were able to show "no actually > 0 of this is well founded", someone might be tempted. But you'd need someone who showed up days later, knows enough Lean to validate for you, yet, not enough Lean to know it's a joke just from looking at the links.

You're welcome! Don't mean to be mean (pun intended), hope you don't read it that way. Just, figured it'd give you some food for thought re: exactly how much work you can expect from other people, and that you might need to set more constraints on an "idk, can someone else tell me more?" reaction than "one person said something, but someone else said they're wrong, so score is 1 to 1"

kleiba · 2026-03-27T03:16:26 1774581386

Thanks again - this time I have to admit I really don't get what you're trying to say?!

refulgentis · 2026-03-27T04:55:32 1774587332

Sorry, I was unclear!

You said you wished someone with Lean knowledge could check my work. I'm saying: you can check it yourself, right now, without knowing Lean.

Click any of links [2] through [5]. You'll see the word `axiom` in the code. In Lean, `axiom` means "assume this is true without proof." That's it. That's the whole critique. The paper says "formally verified," but the key claims — FFN computes Bayesian update, attention routes correctly, BP converges, no hallucination — are all just assumed.

You don't need to take my word for it, and you don't need a Lean expert to break a tie. The evidence is right there in the links. It'd be like a paper claiming "we formally proved this bridge is safe" and the proof says "Axiom: this bridge is safe." You don't need a civil engineer to tell you that's not a proof.

jack_pp · 2026-03-25T03:31:11 1774409471

I suspect it means it's LLM generated without it being checked

handedness · 2026-03-25T19:02:35 1774465355

The Coefficient of Sketch here feels pretty high: https://xcancel.com/gregcoppola5d

getnormality · 2026-03-25T02:44:47 1774406687

> Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network.

Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.

Mithriil · 2026-03-25T03:09:47 1774408187

Bayesian network is a really general concept. It applies to all multidimensional probability distribution. It's a graph that encodes independence between variables. Ish.

I have not taken the time to review the paper, but if the claim stands, it means we might have another tool to our toolbox to better understand transformers.

malcolmgreaves · 2026-03-21T11:36:55 1774093015

> Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts.

NNs are as close to continuous as we can get with discrete computing. They’re flexible and adaptable and can contain many “concepts.” But their chief strength is also their chief weakness: these “concepts” are implicit. I wonder if we can get a hybrid architecture that has the flexibility of NNs while retaining discrete concepts like a knowledge base does.

AdieuToLogic · 2026-03-25T03:29:06 1774409346

> NNs are as close to continuous as we can get with discrete computing.

This is incorrect. For example, fuzzy logic[0] can model analog ("continuous") truth beyond discrete digital representations, such as 1/0, true/false, etc.

0 - https://en.wikipedia.org/wiki/Fuzzy_logic

measurablefunc · 2026-03-25T02:25:53 1774405553

There is nothing continuous on the computer, it's all bit strings & boolean arithmetic. The semantics imposed on the bit strings does not exist anywhere in the arithmetic operations, i.e. there is no arithmetic operation corresponding to something as simple as the color red.

kelseyfrog · 2026-03-25T02:29:08 1774405748

It sounds like you're saying that if a computer had infinite precision then hallucinations would not occur?

measurablefunc · 2026-03-25T02:34:54 1774406094

The way neural networks work is that the base neural network is embedded in a sampling loop, i.e. a query is fed into the network & the driver samples output tokens to append to the query so that it can be re-fed back into the network (q → nn → [a, b, c, ...] → q + sample([a, b, c, ...])). There is no way to avoid hallucinations b/c hallucinations are how the entire network works at the implementation level. The precision makes no difference b/c the arithmetic operations are semantically void & only become meaningful after they are interpreted by someone who knows to associated 1 /w red, 2 w/ blue, 3 w/ clouds, & so on & so forth. The mapping between the numbers & concepts does not exist in the arithmetic.

kelseyfrog · 2026-03-25T02:42:54 1774406574

Oh, I thought that the embedding space of the residual stream was precisely that.

measurablefunc · 2026-03-25T02:47:13 1774406833

The arithmetic is meaningless, it doesn't matter what you call it b/c on the computer it's all bit strings & boolean arithmetic. You can call some sequence of operations residual & others embeddings but that is all imposed top-down. There is nothing in the arithmetic that indicates it is somehow special & corresponds to embeddings or residuals.

kelseyfrog · 2026-03-25T02:52:03 1774407123

Ah ok, so if we had such a mapping then models wouldn't hallucinate?

measurablefunc · 2026-03-25T02:58:34 1774407514

Maybe it's better if you define the terms b/c what I mean by hallucination is that the arithmetic operations + sampling mean that it's all hallucinations. The output is a trajectory of a probabilistic computation over some set of symbols (0s & 1s). Those symbols are meaningless, the only reason they have meaning is b/c everyone has agreed that the number 97 is the ascii code for "a" & every conformant text processor w/ a conformant video adapter will convert 97 (0b1100001) into the display pattern for the letter "a".

kelseyfrog · 2026-03-25T03:16:37 1774408597

So kind of like if you flip a coin, the sampling means the heads or tails you get isn't real?

measurablefunc · 2026-03-25T17:14:10 1774458850

It's when you define heads or tails however you want & then tell me you have objective semantics for each side of the coin when all you've really done is established a convention about which side is which. The coin is real, what you call each side is a convention & what semantics you attach to a sequence of flips is also a convention that has nothing to do with the reality of the coin.

kelseyfrog · 2026-03-25T20:55:26 1774472126

I'm struggling to differentiate that from how we use coinflips normally. We can pretty easily create arbitrary mappings and then sample from the binomial in a way that has meaning far beyond just heads or tails. Maybe I'm not quite understanding.

measurablefunc · 2026-03-25T21:27:34 1774474054

Which part are you confused about? Symbols are meaningless until someone imposes semantics on them. There is nothing meaningful about arithmetic in a neural network other than whatever conventions are imposed on the binary sequences, same way 97 has no meaning other than the conventional agreement that it is the ascii code point for "a".

kelseyfrog · 2026-03-25T23:02:43 1774479763

I guess I don't get the main idea. Chemical reactions in our brains are semantically void and yet we're able to use it as substrate for thinking.

measurablefunc · 2026-03-25T23:36:22 1774481782

This has nothing to do with chemical reactions. The discussion was about symbols and arithmetic. But in any event, this discussion has run its course so good luck: https://chatgpt.com/s/t_69c473e1f71c8191a4ed1e3e2dbdef83

kelseyfrog · 2026-03-26T00:43:50 1774485830

Yes! Excellent example of an ungrounded response, a hallucination.

measurablefunc · 2026-03-26T00:55:09 1774486509

Also a demonstration of your rhetoric.

naasking · 2026-03-25T04:20:33 1774412433

> The semantics imposed on the bit strings does not exist anywhere in the arithmetic operations,

Correct, the semantics is actually in the network of relations between the nodes. That has been one of the major lessons of LLMs, and it validates the systems response to Searle's Chinese Room.

westurner · 2026-03-20T21:42:33 1774042953

https://news.ycombinator.com/item?id=45256179 :

> Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes [...]

Ironic then, because if transformers are Bayesian networks then we're using Bayesian networks for non-independent features.

From "Quantum Bayes' rule and Petz transpose map from the minimum change principle" (2025) https://news.ycombinator.com/item?id=45074143 :

> Petz recovery map: https://en.wikipedia.org/wiki/Petz_recovery_map :

> In quantum information theory, a mix of quantum mechanics and information theory, the Petz recovery map can be thought of as a quantum analog of Bayes' theorem

But there aren't yet enough qubits for quantum LLMs: https://news.ycombinator.com/item?id=47203219#47250262

"Transformer is a holographic associative memory" (2025) https://news.ycombinator.com/item?id=43028710#43029899

wklm · 2026-03-25T08:44:10 1774428250

I like their definition of hallucination