"For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each"
The whole thing is completely invalid for measuring actual I/O hierarchy efficiencies because of (a) write sizes too small, would be in buffer cache of unknown hotness, (b) dmcrypt introduces a whole layer of indirection and timing variability and (c) on an SSD, almost anything could be happening regarding cache and syncs. Also, mount options, % disk used, small sample sizes, unknown contention effects, etc.
This is a good example of how to convince yourself of something and yet be less accurate than a divining rod.