> Process is nearly everything in performance/watt. ARM has consistently beat x8...

dragontamer · on Oct 25, 2021

> optimizations that allow these chips to have surreally large reordering buffer

But only Apple's chip has a large reordering buffer. ARM Neoverse V1 / N1 / N2 don't have it, no one else is doing it.

Apple made a bet and went very wide. I'm not 100% sure if that bet is worth the tradeoffs. I'm certain that if other companies thought that a larger reordering buffer was useful, they'd have done it.

I'll give credit to Apple for deciding that width still had places to grow. But its a very weird design. Despite all that width, Apple CPUs don't have SMT, so I'd expect that a lot of the performance is "wasted" with idle pipelines, and that SMT would really help out the design.

Like, who makes an 8-wide chip that supports only 1 thread? Apple but... no one else. IBM's 8-wide decode is on a SMT4 chip (4-threads per core).

rbanffy · on Oct 25, 2021

SMT is a good way to extract parallelism when your ISA makes it more difficult to do (with speculative execution/register renaming). ARM, it seems, makes it easier to the point I don't think any ARM CPU has been using multiple threads per core.

I would expect POWER to be more amenable to it, but x86 borrows heavily from the 8085 ISA and was designed at a time the best IPC you could hope to get was 1.

pthariensflame · on Oct 26, 2021

Minor aside: Arm does, in fact, have a recent CPU family with 2-way SMT: Cortex-A65(AE)/Neoverse E1.