"We have demonstrated that it is impossible to describe all aspects of physical reality using a computational theory of quantum gravity," says Dr. Faizal. "Therefore, no physically complete and consistent theory of everything can be derived from computation alone. Rather, it requires a non-algorithmic understanding, which is more fundamental than the computational laws of quantum gravity and therefore more fundamental than spacetime itself."
Seems like quantum gravity theory might be missing something, no?
It's such a silly idea that whatever is simulating us would be in any way similar or care about what's possible in our universe. It's like a game of life glider thinking it can't be simulated because someone would have to know what's beyond the neighbouring cells and that's impossible! But the host universe just keeps chugging along unimpressed by our proofs.
>If you're saying that people always try to game the system, whatever it is, then I agree however.
This isn't even true either. In the past there was a huge emphasis and effort made toward character. Going out of your way to do the right thing and be helpful and NOT getting special treatment but choosing the difficult path.
Now everything is the opposite it's about getting as much special treatment as possible and shirking as much responsibility and this isn't just people it's throughout the corporate and political system as well.
I'm aware. See for instance, VC Arielle Zuckerberg's comment that when deciding which founders to fund she looks for "a little of the rizz and a little of the tis" with "rizz" referring to charisma and "tis" to autism.
One could argue that mythologizing a particular characteristic is itself a form of stigma.
"Which it be, ye scallywag, but nay—mark me well—’tis no tavern’s tall tale. Nay, ’tis truth carved in bone and blood, a story black as the Devil’s beard, yet truthful enough to make Death himself chuckle in his coffin."
(╯︵╰,)
/ \
₿
There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.
The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.
If you can give that bibliography I'd love to read it. I have the same intuition and a few papers seem to support it but more and explicit ones would be much better.
What do you mean by work better here? If it's for better accuracy then no they are not better at the same weight dimensions.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes. More dimensions leading to better results does not seem to be under a lot of contention, the open questions are more about quantifying that. It's simply not shown experimentally because the hardware is not there to train it.
The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes.
Do you have any evidence to support this statement? Or are you imagining some not yet invented algorithms running on some not yet invented hardware?
Sparse matrices can increase in dimension while keeping the same number of non-zeroes, that part is self evident. Sparse weights models can be trained, you probably are already aware of RigL and SRigL, there is similar other related work on unstructured and structured sparse training. You could argue that those adapt their algorithm to be executable on GPUs and that none are training at x100 or x1000 dimensions. Yes, that is the part that requires access to sparse compute hardware acceleration, which exists as prototypes [1] or are extremely expensive (Cerebras).
Unstructured sparsity cannot be implemented in hardware efficiently if you still want to do matrix multiplication. If you don’t want to do matrix multiplication you first need to come up with new algorithms, tested in software. This reminds me of what Numenta tried to do with their SDRs - note they didn’t quite succeed.
> Unstructured sparsity cannot be implemented in hardware efficiently if you still want to do matrix multiplication.
Hard disagree. It certainly is a magnitude harder to design hardware for sp x sp MM, yes; it requires a paradigm shift to do sparse compute efficiently, but there are hardware architectures both in research and commercially available that do it efficiently. The same kind of architecture is needed to scale op graph compute. You see solutions at the smaller scale in FPGA and reconfigurable/dataflow accelerators, larger scale in Intel's PIUMA and Cerebras. I've been involved in co-design work of Graphblas on the software side and one of the aforementioned hardware platforms: the main issue with developing SpMSpM hardware lies more with the necessary capital and engineering investments being prioritized to current frontier AI model accelerators, not because of lack of proven results.
All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.
Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.
For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.
Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.
“Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.
Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.
Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.
From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.
EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:
To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).
Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).
### 1\. The Representation: Hyperdimensional Computing (HDC)
Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters.
To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.
* **Holographic Representation:** In HDC, concepts (like "cat") are stored as massive randomized vectors of 1s and 0s. Information is distributed "holographically" across the entire vector. You can cut the vector in half, and it still retains the information (just noisier), similar to how brain lesions don't always destroy specific memories.
* **Math without Multiplication:** In this high-dimensional binary space, you don't need expensive floating-point matrix multiplication. You can use simple bitwise operations:
* **Binding (Association):** XOR operations (`A ⊕ B`).
* **Bundling (Superposition):** Majority rule (voting).
* **Permutation:** Bit shifting.
* **Simulation Benefit:** This allows a Transformer to manipulate massive "context windows" using extremely cheap binary logic gates instead of energy-hungry floating-point multipliers.
### 2\. The Architecture: "Spiking" Attention Mechanisms
Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.
* **Dynamic Sparsity:** Instead of a dense matrix multiplication, neurons would only "fire" (send a signal) if their activation crosses a threshold. If a token's relevance score is low, it sends *zero* spikes. The hardware performs *no* work for that connection.
* **The "Winner-Take-All" Circuit:** In the brain, inhibitory neurons suppress weak signals so only the strongest "win." A simulated Sparse Transformer would replace the Softmax function (which technically keeps all values non-zero) with a **k-Winner-Take-All** function.
* *Result:* The attention matrix becomes 99% empty (sparse). The system only processes the top 1% of relevant connections, similar to how you ignore the feeling of your socks until you think about them.
### 3\. The Hardware: Neuromorphic Substrate
Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).
* **Address Event Representation (AER):** Instead of a "clock" ticking every nanosecond forcing all neurons to update, the hardware is asynchronous. It sits idle (consuming nanowatts) until a "spike" packet arrives at a specific address.
* **Processing-in-Memory (PIM):** To handle the high dimensionality (e.g., 100,000-dimensional vectors), the hardware moves the logic gates *inside* the RAM arrays. This eliminates the energy cost of moving those massive vectors back and forth.
### Summary: The Hypothetical "Spiking HD-Transformer"
I’m not sure why you’re talking about efficiency when the question is “do sparse models work better than dense models?” The answer is no, they don’t.
Even the old LTH paper you cited trains a dense model and then tries to prune it without too much quality loss. Pruning is a well known method to compress models - to make them smaller and faster, not better.
Before we had proper GPUs everyone said the same thing about Neural Networks.
Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.
There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.
Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.
Yes, we know that large dense layers work better than small dense layers (up to a point). We also know how to train large dense models and then prune them. But we don’t know how to train large sparse models to be better than large dense models. If someone figures it out then we can talk about building hardware for it.
It isn't directly what you are asking for, but there is a similar relationship at work with respect to L_1 versus L_2 regularization. The number of samples required to train a model is O(log(d)) for L_1 and O(d) for L_2 where d is the dimensionality [1]. This relates to the standard random matrix results about how you can approximate high dimensional vectors in a log(d) space with (probably) small error.
At a very handwaving level, it seems reasonable that moving from L_1 to L_0 would have a similar relationship in learning complexity, but I don't think that has every been addressed formally.
If you consider it in absolute values it makes sense. Bezos could give me a billion dollars which would match my wealth with Pichai's, and he'd still have 199 billion dollars
Yes, if you have a billion dollars then in terms of wealth Pichai is closer to you than to Bezos. But if you’re a typical HN reader (level 4 or 5), the difference between you and Pichai is pretty much infinite, while Pichai and Bezos are almost the same (relative to you): both are ultra rich.
If Google is not willing to scale it up, then why would anyone else?
reply