This sounds very interesting but I lack the technical depth in language models t...

rafaelero · on Jan 27, 2023

> PaLM (and its advantages/disadvantages relative to other LMs)

It's a very similar architecture to other LLM's like GPT-3. I am not sure, but perhaps it makes minor modifications like the activation function employed, but other than that it's the same. The relevant factors that vary between these models is the training dataset and parameter count.

> What a "mixture-of-denoiser objective" is

GPT-3 and PaLM are both encoder-only models. That is, they work by trying to predict the next token. BERT, on the other hand, uses an encoder before predicting a token, and this encoder is bidirectional (in the sense that it can see all tokens in the text, not only everything that comes before the one you are trying to predict). It has been noticed that bidirectional models tend to be more sample efficient. It looks like being able to see everything while predicting some part of the text boosts the model performance during training. I think UL2 is a way to incorporate this insight into decoder-only models. Bidirectionality can be achieved by masking part of the corpus and then asking it to regenerate it (you can see examples in page 13 of the linked paper).

> How "the scaling properties" are measured

In this case, they measured it by just training 3 PaLM models (one with 8B parameters, another with 62B and the last with 540B). They observe a positive trend between number of parameters and accuracy on multiple downstream tasks. By training 64B PaLM with UL2, they observe that it achieves the same accuracy as 540B. That's what they mean by improving scaling properties.

sebzim4500 · on Jan 28, 2023

>GPT-3 and PaLM are both encoder-only models

You mean decoder-only

p1esk · on Jan 27, 2023

Did you try reading the paper (past the abstract)? It provides the reference to the original PaLM paper and answers the rest of your questions.

jxf · on Jan 27, 2023

No; it wouldn't load when I tried.