Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Process is nearly everything in performance/watt.

ARM has consistently beat x86 in performance/watt at larger node sizes since the beginning. The first Archimedes had better floating point performance without a dedicated FPU than the then market-leading Compaq 386 WITH an 80387 FPU.

A lot of the extra performance of the M1 family has nothing to do with node, but with the fact the ARM ISA is much more amenable to a lot of optimizations that allow these chips to have surreally large reordering buffer, which, in turn, keep more of the execution ports busy at any given time, resulting in a very high ICP. Less silicon used to deal with a complicated ISA also leaves more space for caches, which are easier to manage (remember the more regular instructions), putting less stress on the main memory bus (which is insanely wide here, BTW). On top of that, the M1 family has some instructions that help make JavaScript code faster.

So, assume that Intel and AMD, when they get 5nm designs, will have to use more threads and cores to extract the same level of parallelism that the M1 does with an arm (no pun intended) tied behind its back.



> optimizations that allow these chips to have surreally large reordering buffer

But only Apple's chip has a large reordering buffer. ARM Neoverse V1 / N1 / N2 don't have it, no one else is doing it.

Apple made a bet and went very wide. I'm not 100% sure if that bet is worth the tradeoffs. I'm certain that if other companies thought that a larger reordering buffer was useful, they'd have done it.

I'll give credit to Apple for deciding that width still had places to grow. But its a very weird design. Despite all that width, Apple CPUs don't have SMT, so I'd expect that a lot of the performance is "wasted" with idle pipelines, and that SMT would really help out the design.

Like, who makes an 8-wide chip that supports only 1 thread? Apple but... no one else. IBM's 8-wide decode is on a SMT4 chip (4-threads per core).


SMT is a good way to extract parallelism when your ISA makes it more difficult to do (with speculative execution/register renaming). ARM, it seems, makes it easier to the point I don't think any ARM CPU has been using multiple threads per core.

I would expect POWER to be more amenable to it, but x86 borrows heavily from the 8085 ISA and was designed at a time the best IPC you could hope to get was 1.


Minor aside: Arm does, in fact, have a recent CPU family with 2-way SMT: Cortex-A65(AE)/Neoverse E1.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: