Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is going to shred when it reaches industry.

But yeah very math heavy for a software engineering paper.




You seem to know your stuff.

Will this technique work with existing model weights?


Yes, it is just a way of computing the self-attention in a distributed way




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: