Many of the algorithms in BLAS are not easily parallelized. For example, a QR factorization an inherently sequential algorithm. Optimizing BLAS performance comes mainly from rewriting the sequential algorithm into larger blocks so as to efficiently access memory. As Jim Demmel is fond of saying, floating point optimizations are cheap, memory movement is expensive.