Transcending Scaling Laws with 0.1% Extra Compute

whatshisface · on Jan 27, 2023

Here is a summary:

- Training on a mixture of fill-in-the-gaps (a few missing words) and denoising (every word slightly corrupted) produces better LLMs than either one alone.

- This advantage (that of using both metrics at once) can be gained with just a little extra training on a model previously trained with only one of them.

This results in 2-4% improvements on most tasks, with a couple really big improvements (+20%) and one quite surprising (+60%) on a few BigBench tasks. The large percent improvements on the BigBench tasks seem to have more to do with the low initial performance than the new performance being outstanding; the 60% was from 7.6% right to 12.5% right.

sebzim4500 · on Jan 28, 2023

>the 60% was from 7.6% right to 12.5% right.

That seems like an absurd metric for judging the improvement in a test. It is wrong on both extremes, 1% to 5% is not really a 400% improvement and 99% to 100% is a drastic improvement despite only being a ~1% improvement by this metric.

vinaychandranp · on Jan 28, 2023

Relative error reduction might be a better way to talk about this. 1% to 5% would result in a relative error reduction of ~4% and 99% to 100% would be a 100% reduction in the error.

mcint · on Jan 27, 2023

Thank you for the summary!

I quite like this idea, train on a mixture of simple critics, or really here, simple sources of noise.

mcint · on Jan 27, 2023

Linking Paper's with Code for its listing of relevant: tasks, datasets, and metrics with global ranking against other models on defined tasks.

https://paperswithcode.com/paper/transcending-scaling-laws-w...

programd · on Jan 27, 2023

From figure 7 in the paper,

Original Model Input: "A haiku about a cat baking a cake on a lake."

U-PalM output: "A cat is baking a cake on a lake. The cake is a lie."

Aside form the comedic aspects of this exchange there's an interesting issue here of making AI models distinguishing between reality and memes.

Is anybody training a model just on memes? Should that be fed into models which seek to provide outputs for real world factual replies? Are we really this close to creating GLaDOS? So many questions...

rcme · on Jan 28, 2023

The response from U-PaLM seems like a totally reasonable response. I don’t think you’d apply such scrutiny to a human giving the exact same response.

atdrummond · on Jan 28, 2023

Is it a haiku though? That seemed to be one of the 2 metrics and I can’t make it work with any of the common forms.

nerdponx · on Jan 27, 2023

> Are we really this close to creating GLaDOS

No, because GLaDOS knows what it's saying. This is presumably some combination of "playful tone" + "cake" = "meme with the word cake in it".

fallingknife · on Jan 28, 2023

Define "knows what it's saying"

bee_rider · on Jan 28, 2023

Since GladOS is a fictional character, I’m not sure if we can reasonably argue about whether it is sentient or not. I mean canonically it either is or not, it would just be author fiat.

cscurmudgeon · on Jan 28, 2023

It is well known in AI and is covered in standard AI books.

A good start

https://www.cambridge.org/core/books/epistemic-logic-for-ai-...

dgreensp · on Jan 27, 2023

By "transcending scaling laws" and "improve the scaling properties," do they just mean higher-quality output compared to using the same (or smaller) model size with previous methods?

aabhay · on Jan 28, 2023

Yeah this is just a super embellished title. Nothing is being transcended. We’re just optimizing models as per usual in the years after a new architecture is launched.

jxf · on Jan 27, 2023

This sounds very interesting but I lack the technical depth in language models to understand it. In particular I can't parse the following excerpt:

> The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM.

Things I don't understand:

* PaLM (and its advantages/disadvantages relative to other LMs)

* What a "mixture-of-denoiser objective" is

* How "the scaling properties" are measured

I'd be interested in a more accessible summary of how this works, if HN has any references.

rafaelero · on Jan 27, 2023

> PaLM (and its advantages/disadvantages relative to other LMs)

It's a very similar architecture to other LLM's like GPT-3. I am not sure, but perhaps it makes minor modifications like the activation function employed, but other than that it's the same. The relevant factors that vary between these models is the training dataset and parameter count.

> What a "mixture-of-denoiser objective" is

GPT-3 and PaLM are both encoder-only models. That is, they work by trying to predict the next token. BERT, on the other hand, uses an encoder before predicting a token, and this encoder is bidirectional (in the sense that it can see all tokens in the text, not only everything that comes before the one you are trying to predict). It has been noticed that bidirectional models tend to be more sample efficient. It looks like being able to see everything while predicting some part of the text boosts the model performance during training. I think UL2 is a way to incorporate this insight into decoder-only models. Bidirectionality can be achieved by masking part of the corpus and then asking it to regenerate it (you can see examples in page 13 of the linked paper).

> How "the scaling properties" are measured

In this case, they measured it by just training 3 PaLM models (one with 8B parameters, another with 62B and the last with 540B). They observe a positive trend between number of parameters and accuracy on multiple downstream tasks. By training 64B PaLM with UL2, they observe that it achieves the same accuracy as 540B. That's what they mean by improving scaling properties.

sebzim4500 · on Jan 28, 2023

>GPT-3 and PaLM are both encoder-only models

You mean decoder-only

p1esk · on Jan 27, 2023

Did you try reading the paper (past the abstract)? It provides the reference to the original PaLM paper and answers the rest of your questions.

jxf · on Jan 27, 2023

No; it wouldn't load when I tried.

next_xibalba · on Jan 28, 2023

Quoc Le is a global treasure. How is he not more famous?

6gvONxR4sf7o · on Jan 27, 2023

> Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving ∼4.4 million TPUv4 hours).

Are you allowed to call your own work impressive in your abstract? Cool work, but that line is "transcendent."

Anyways, aren't scaling laws more like O(whatever) asymptotics? Like if you reduce your sorting algo from 6.4 n^2 seconds to 3.2 n^2, you don't say you "transcended the scaling laws," even though you sped it up a very significant amount. Am I misunderstanding?

airgapstopgap · on Jan 28, 2023

They talk specifically of scaling laws investigated in papers like OpenAI's "Scaling Laws for Neural Language Models" (2020)[1] and Deepmind's "An empirical analysis of compute-optimal large language model training" (2022)[2] which showed that, as far as we could see at the time, for any sensibly designed language model nuances of architecture do not matter nearly as much as the sheer scale of the model, amount of compute spent on training it, and volume of training dataset (or, indeed, that they do not matter at all, or even that attempts to add some cleverness via architecture ultimately handicap the model at production scale). These papers have introduced so-called "scaling laws" – predictable, robust relationships between scaling parameters and loss (which is meaningfully related to textual coherence and apparent "understanding" in zero-shot tasks), and for many have heralded the age of throwing compute at the problem without academic tinkering, and possibly the final stretch before human-level text processing. See also Rich Sutton's Bitter Lesson [3].

In this sense it is remarkable when the curve predicted from "scaling laws" gets broken through, transcended, since it suggests there still exists a better Pareto frontier for text prediction. I see how this can appear normal; but the thing is, scaling laws have been well-validated. We are no longer in the early era to expect to reap substantial returns from something as simple as changing training objective.

1. https://arxiv.org/abs/2001.08361

2. https://www.deepmind.com/publications/an-empirical-analysis-...

3. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

6gvONxR4sf7o · on Jan 28, 2023

It seems like every other paper in the scaling laws lit finds different scaling laws (if you consider the constant factors important).

IanCal · on Jan 27, 2023

I don't think that's the scaling they're talking about.

I think they're talking about how the models performance scales with parameter size.

hobs · on Jan 27, 2023

Yes, that's day one stuff that the coefficient gets thrown away.

p1esk · on Jan 27, 2023

aren't scaling laws more like O(whatever) asymptotics?

Not if scaling is linear.

Nevermark · on Jan 27, 2023

“Transcending a scaling law” and cutting a constant in half, shouldn’t be considered synonymous.

Nice result regardless.

Dylan16807 · on Jan 27, 2023

I was going to say it could affect the scaling if the speedup changes with model size.

But I checked the paper and it's about a 2x speedup on the 8B model too.