Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This sounds very interesting but I lack the technical depth in language models to understand it. In particular I can't parse the following excerpt:

> The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM.

Things I don't understand:

* PaLM (and its advantages/disadvantages relative to other LMs)

* What a "mixture-of-denoiser objective" is

* How "the scaling properties" are measured

I'd be interested in a more accessible summary of how this works, if HN has any references.



> PaLM (and its advantages/disadvantages relative to other LMs)

It's a very similar architecture to other LLM's like GPT-3. I am not sure, but perhaps it makes minor modifications like the activation function employed, but other than that it's the same. The relevant factors that vary between these models is the training dataset and parameter count.

> What a "mixture-of-denoiser objective" is

GPT-3 and PaLM are both encoder-only models. That is, they work by trying to predict the next token. BERT, on the other hand, uses an encoder before predicting a token, and this encoder is bidirectional (in the sense that it can see all tokens in the text, not only everything that comes before the one you are trying to predict). It has been noticed that bidirectional models tend to be more sample efficient. It looks like being able to see everything while predicting some part of the text boosts the model performance during training. I think UL2 is a way to incorporate this insight into decoder-only models. Bidirectionality can be achieved by masking part of the corpus and then asking it to regenerate it (you can see examples in page 13 of the linked paper).

> How "the scaling properties" are measured

In this case, they measured it by just training 3 PaLM models (one with 8B parameters, another with 62B and the last with 540B). They observe a positive trend between number of parameters and accuracy on multiple downstream tasks. By training 64B PaLM with UL2, they observe that it achieves the same accuracy as 540B. That's what they mean by improving scaling properties.


>GPT-3 and PaLM are both encoder-only models

You mean decoder-only


Did you try reading the paper (past the abstract)? It provides the reference to the original PaLM paper and answers the rest of your questions.


No; it wouldn't load when I tried.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: