A Systematic Investigation of Commonsense Understanding in Large Language Models

PaulHoule · on Nov 19, 2021

For a long time I've seen a "double standard" in how people thought about the old AI and how they think about the new AI.

People looked at systems like Eliza and it was obvious pretty quickly that they lacked the structure to solve the language understanding problem and also that they relied on people's meaning-making ability to seem like they are conversing with them.

It drives me nuts that people insist that the emperor wears clothes with things like GPT-3; that GPT-3 is kept under wraps not because it is so powerful as to create ethical problems, but rather to prevent people from seeing how clueless it really is. (e.g. something that can write text that is superficially like what an expert writes is in no way a model of or substitute for the expert.)

It makes me so glad to see somebody looking at this soberly for once.

stellaathena · on Nov 20, 2021

This paper doesn't make a whole lot of sense to me. I'm not familiar with any work that claims meaningful zero-shot performance in models as small as the ones considered in this paper. Quite the opposite, both the GPT-3 and FLAN papers claim that zero-shot behavior doesn't arise until much larger than 7B. The T0 paper is the paper that seems closest to the claims in this paper, but even then they're talking about a multitask trained 11B parameter model, not a 7B "normal" model.

The paper opens by saying "Large language models (with more than 1 billion parameters) perform well on a range of natural language processing (NLP) tasks in zero- and few-shot settings, without requiring task-specific supervision," and cite Radford et al. (2019), Brown et al. (2020), and Patwary et al., (2021) for this sentence. But the first paper doesn't claim zero-shot performance and the other two sources are about models orders of magnitude larger! I don't see any evidence in this paper to support the idea that there are people going around claiming that 1B+ parameter models have impressive zero-shot performance.

They make a similar error that reinforces this one in their conclusion as well. They say "At first sight, these models show impressive zero-shot performance suggesting that they capture commonsense knowledge," completely ignoring the fact that they did not in fact ever show that. They also never explain how large their "SOTA" model is, which seems quite important. They additionally never compare to the models that are actually claiming zero-shot performance. Their model performs similarly to GPT-3 13B on HellaSwag, 1.3B on Winogrande, and 6.7B on PiQA. There's clearly some large confounding factors they aren't controlling for here.

The fact that ML researchers, and especially NLP researchers, use extremely low quality baselines is not news. People publish papers pointing this out all the time. The same is true of the fact that many NLP datasets are garbage. "PiQA and HellaSwag are bad evaluation metrics" is potentially a worthwhile inclusion to the literature (I haven't checked if these particular datasets have been critiqued) but is something personally known to me and something that no NLP researcher should find surprising. If people are surprised by this, I think that those people really need to spend more time reading the literature and evaluating datasets. Your default assumption should be that a benchmark eval is loosely correlated with what it nominally measures. And all of these things really have no bearing on zero shot generalization.

If the primary value of this paper is pointing out that the datasets are bad, that can manifest as misleading zero-shot scores but it also manifests as misleading few-shot scores and misleading fine-tuned scores. I would expect a finetuned and few-shot version of Fig 6 to look pretty much the same, but we are not shown such plots.

Why? I can't be sure, but the authors take themselves to be specifically criticizing zero-shot claims and if those plots look as I expect it would significantly undermine their claims. And even if they don't, these models are not being evaluated in a regime in which people are actually claiming significant zero-shot performance. The paper's entire framing is predicated on the false claim that people are claiming that 1B+ parameter models are zero-shot learners.

holonomically · on Nov 20, 2021

> The paper opens by saying "Large language models (with more than 1 billion parameters) perform well on a range of natural language processing (NLP) tasks in zero- and few-shot settings, without requiring task-specific supervision," and cite Radford et al. (2019), Brown et al. (2020), and Patwary et al., (2021) for this sentence. But the first paper doesn't claim zero-shot performance and the other two sources are about models orders of magnitude larger! I don't see any evidence in this paper to support the idea that there are people going around claiming that 1B+ parameter models have impressive zero-shot performance.

Maybe not zero-shot performance but they definitely claim few-shot performance which is what your quoted sentence is saying:

> Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. [1]

1: Language Models are Few-Shot Learners - https://arxiv.org/abs/2005.14165

stellaathena · on Nov 21, 2021

That paper also claims zero-shot performance. However, the models that the bulk of the claims are made about are 20x the size of the ones considered in this paper. That's completely consistent with what I said.

holonomically · on Nov 21, 2021

I don't see how that makes a difference. Even if the models were 100x the size it's not like the same argument couldn't be carried through. There is no theoretical reason to believe increasing the number of parameters does anything more than simply allow encoding more of the training set into the parameters. There seems to be a fundamental confusion about what these language models are actually doing, they're glorified compression algorithms. [1] There is no reason to expect any kind of generalization performance from them on common sense tasks.

1: https://bellard.org/nncp/