Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Many of the algorithms in BLAS are not easily parallelized. For example, a QR factorization an inherently sequential algorithm. Optimizing BLAS performance comes mainly from rewriting the sequential algorithm into larger blocks so as to efficiently access memory. As Jim Demmel is fond of saying, floating point optimizations are cheap, memory movement is expensive.


QR decomposition isn’t in BLAS, you’re probably thinking of LAPACK.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: