I'm not sure this holds. X64 decodes instructions (which is awkward) and stores ...

I'm not sure this holds. X64 decodes instructions (which is awkward) and stores the result in a cache, then interprets the opcodes from that cache. So the decoding cost only happens on a cache miss, and a cache miss on a deeply pipelined CPU is roughly game over for performance anyway.