There was a really interesting post a while ago about adjusting the softmax func...

zorgmonkey · on Feb 13, 2024

Feel free to mess with it, his tweak to softmax was actually supported by pytorch before the article was written, but off by default. Maybe it needs to be more widely used though, after all good ideas are often independently discovered multiple times. Details are in this tweet https://twitter.com/SamuelMullr/status/1683582347793530884 or if you don't like twitter the option is add_zero_attn for pytorch MultiheadAttention.

magicalhippo · on Feb 13, 2024

Interesting! HN discussion of it here: https://news.ycombinator.com/item?id=36851494