Thanks for the benchmarks. Do you see some intrinsic reason why a Fortran compiler couldn't do these optimizations? I would think it would know all the information so it would be the ideal place to optimize it.
There is no reason it could not. Those optimizations just have to be implemented.
Flang (the one merged into LLVM) is using MLIR, which has all the required code-gen abilities. That just leaves the cost modeling / deciding which optimizations/transformations to apply.
For BLAS in particular, this paper can give you an idea of some of MLIR's capabilities: https://arxiv.org/pdf/2003.00532.pdf
(But maybe you already know them better than I do.)
LoopVectorization can't do many of these yet, so its performance will fall off a cliff shortly after the largest size on the plots (and at much smaller sizes for CPUs with a smaller L2 cache). I had to add code to perform packing/tiling in my actual matmul code on top of what it did.
So that MLIR can generate that sort of code already looks promising.
Still, the work of telling it what to do isn't easy.
I'm not involved in any of those projects, so everything I say here is pure speculation. But I imagine pragmas and the like would be important whenever the compiler doesn't know the sizes at compile time.
Otherwise, you probably don't want it to generate massive amounts of code through multiple extra blocking loops, massive unrolling in a main kernel, and multiple clean up kernels for every random loop nest.