The abstract https://arxiv.org/abs/1706.03762 explains it well: *"The dominant s...

The abstract https://arxiv.org/abs/1706.03762 explains it well:

"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."

They did not invent attention, but while previous language models had used attention as an auxiliary mechanism, they removed everything but the attention and the models still worked. Really, the title already says it all.