Agreed, compared to other architectures, transformers are actually quite straigh...

Agreed, compared to other architectures, transformers are actually quite straight-forward. The complicated part comes more from training it in distributed setups, making the data loading and tensor parallelism work due to the large size etc. Like the vanilla architecture is simple, but the practical implementation for large-scale training can be a bit complicated.