- Training on a mixture of fill-in-the-gaps (a few missing words) and denoising (every word slightly corrupted) produces better LLMs than either one alone.
- This advantage (that of using both metrics at once) can be gained with just a little extra training on a model previously trained with only one of them.
This results in 2-4% improvements on most tasks, with a couple really big improvements (+20%) and one quite surprising (+60%) on a few BigBench tasks. The large percent improvements on the BigBench tasks seem to have more to do with the low initial performance than the new performance being outstanding; the 60% was from 7.6% right to 12.5% right.
That seems like an absurd metric for judging the improvement in a test. It is wrong on both extremes, 1% to 5% is not really a 400% improvement and 99% to 100% is a drastic improvement despite only being a ~1% improvement by this metric.
Relative error reduction might be a better way to talk about this.
1% to 5% would result in a relative error reduction of ~4% and 99% to 100% would be a 100% reduction in the error.
Original Model Input: "A haiku about a cat baking a cake on a lake."
U-PalM output: "A cat is baking a cake on a lake. The cake is a lie."
Aside form the comedic aspects of this exchange there's an interesting issue here of making AI models distinguishing between reality and memes.
Is anybody training a model just on memes? Should that be fed into models which seek to provide outputs for real world factual replies? Are we really this close to creating GLaDOS? So many questions...
Since GladOS is a fictional character, I’m not sure if we can reasonably argue about whether it is sentient or not. I mean canonically it either is or not, it would just be author fiat.
By "transcending scaling laws" and "improve the scaling properties," do they just mean higher-quality output compared to using the same (or smaller) model size with previous methods?
Yeah this is just a super embellished title. Nothing is being transcended. We’re just optimizing models as per usual in the years after a new architecture is launched.
This sounds very interesting but I lack the technical depth in language models to understand it. In particular I can't parse the following excerpt:
> The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM.
Things I don't understand:
* PaLM (and its advantages/disadvantages relative to other LMs)
* What a "mixture-of-denoiser objective" is
* How "the scaling properties" are measured
I'd be interested in a more accessible summary of how this works, if HN has any references.
> PaLM (and its advantages/disadvantages relative to other LMs)
It's a very similar architecture to other LLM's like GPT-3. I am not sure, but perhaps it makes minor modifications like the activation function employed, but other than that it's the same. The relevant factors that vary between these models is the training dataset and parameter count.
> What a "mixture-of-denoiser objective" is
GPT-3 and PaLM are both encoder-only models. That is, they work by trying to predict the next token. BERT, on the other hand, uses an encoder before predicting a token, and this encoder is bidirectional (in the sense that it can see all tokens in the text, not only everything that comes before the one you are trying to predict). It has been noticed that bidirectional models tend to be more sample efficient. It looks like being able to see everything while predicting some part of the text boosts the model performance during training. I think UL2 is a way to incorporate this insight into decoder-only models. Bidirectionality can be achieved by masking part of the corpus and then asking it to regenerate it (you can see examples in page 13 of the linked paper).
> How "the scaling properties" are measured
In this case, they measured it by just training 3 PaLM models (one with 8B parameters, another with 62B and the last with 540B). They observe a positive trend between number of parameters and accuracy on multiple downstream tasks. By training 64B PaLM with UL2, they observe that it achieves the same accuracy as 540B. That's what they mean by improving scaling properties.
> Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving ∼4.4 million TPUv4 hours).
Are you allowed to call your own work impressive in your abstract? Cool work, but that line is "transcendent."
Anyways, aren't scaling laws more like O(whatever) asymptotics? Like if you reduce your sorting algo from 6.4 n^2 seconds to 3.2 n^2, you don't say you "transcended the scaling laws," even though you sped it up a very significant amount. Am I misunderstanding?
They talk specifically of scaling laws investigated in papers like OpenAI's "Scaling Laws for Neural Language Models" (2020)[1] and Deepmind's "An empirical analysis of compute-optimal large language model training" (2022)[2] which showed that, as far as we could see at the time, for any sensibly designed language model nuances of architecture do not matter nearly as much as the sheer scale of the model, amount of compute spent on training it, and volume of training dataset (or, indeed, that they do not matter at all, or even that attempts to add some cleverness via architecture ultimately handicap the model at production scale). These papers have introduced so-called "scaling laws" – predictable, robust relationships between scaling parameters and loss (which is meaningfully related to textual coherence and apparent "understanding" in zero-shot tasks), and for many have heralded the age of throwing compute at the problem without academic tinkering, and possibly the final stretch before human-level text processing. See also Rich Sutton's Bitter Lesson [3].
In this sense it is remarkable when the curve predicted from "scaling laws" gets broken through, transcended, since it suggests there still exists a better Pareto frontier for text prediction. I see how this can appear normal; but the thing is, scaling laws have been well-validated. We are no longer in the early era to expect to reap substantial returns from something as simple as changing training objective.
- Training on a mixture of fill-in-the-gaps (a few missing words) and denoising (every word slightly corrupted) produces better LLMs than either one alone.
- This advantage (that of using both metrics at once) can be gained with just a little extra training on a model previously trained with only one of them.
This results in 2-4% improvements on most tasks, with a couple really big improvements (+20%) and one quite surprising (+60%) on a few BigBench tasks. The large percent improvements on the BigBench tasks seem to have more to do with the low initial performance than the new performance being outstanding; the 60% was from 7.6% right to 12.5% right.