400GB/s is very high for a CPU bandwidth, but is less than of NVidia's GTX 3080 ...

defaultname · on Oct 25, 2021

It's pretty remarkable that now we're not only comparing Apple's SoC to the best CPUs from dedicated makers, we're comparing it to the best GPUs.

Could you qualify what you mean regarding double precision, though? nvidia consumer GPUs have pretty terrible double precision (usually in the range of 1/64th single precision). And FWIW, the normal cores in the M1 (Max|Pro) have fantastic double precision performance, and comprise the bulk of the SPECfp dominance.

dragontamer · on Oct 25, 2021

> It's pretty remarkable that now we're not only comparing Apple's SoC to the best CPUs from dedicated makers, we're comparing it to the best GPUs.

Is it? Apple has 5nm on lockdown right now. Process is nearly everything in performance/watt.

If you want to compare architectures, you compare it on the same process. 5nm vs 5nm is only fair. 5nm vs 7nm is going to be 2x more power efficient from a process level.

When every transistor uses 1/2 the power at the same speed, of course you're going to have a performance/watt advantage. That's almost... not a surprise at all. It is this process advantage that Intel wielded for so long over its rivals.

Now that TSMC owns the process advantage, and now that Apple is the only one rich enough to get "first dibs" on the leading node, its no small surprise to me that Apple has the most power efficient chips. If anything, it shows off how efficient the 7nm designs are that they can compete against a 5nm design.

defaultname · on Oct 25, 2021

> Apple has 5nm on lockdown right now

Qualcomm has loads of 5nm chips. They're pretty solidly beaten by Apple's entrants, but they've been using them for over a year now. Huawei, Marvell, Samsung and others have 5nm products too.

This notion that Apple just bullied everyone out of 5nm is not backed by fact. For that matter, Apple's efficiency holds even at the same node.

There is this weird thing where some demand that we put an asterisk on everything Apple does. I remember the whole "sure it's faster but that's just because of a big cache" (as if that negated the whole faster / more efficient thing, or as if competing makers were somehow forbidden from using larger caches so it was all so unfair). Now it's all waved away as just a node advantage, when any analysis at all reveals that to be nonsensical.

ac29 · on Oct 25, 2021

> This notion that Apple just bullied everyone out of 5nm is not backed by fact.

In the context of laptops, its true. Neither Intel or AMD has chips being built on TSMC N5 or a comparable process. AMD is on TSMC N7, and Intel is currently on their own 10 nm process, moving to "Intel 7" with Alder Lake which is getting formally introduced in 2 days.

defaultname · on Oct 25, 2021

"In the context of laptops, its true"

Intel wasn't in competition for TSMC's processes at all, and AMD was in absolutely no hurry to 5nm (especially given that they were targeting cost effectiveness). The fact that Apple readied a 5nm design, and decided that it was worth it for their customers, in no way indicates that they "bullied" to the front.

Quite contrary, for years Intel made their mobile/ "low power" parts on some of their older processes. It was a low profit part for them and they saved the best for their high end Xeons and so on (where the process benefit was entirely spent on speed -- note that there is a lot of BS about the benefit of process nodes where people claim ridiculous benefits when in reality you can have a small efficiency improvement, or a small performance improvement, but not both. The biggest real benefit is that you can pack more on a given silicon space, in Apple's case loads of cores a fat GPU, big caches, etc). If Apple upset their business model, well tough beans for them.

As an aside, note that the other initial customer of 5nm was HiSilicon (a subsidiary of Huawei) with the Kirin 9000. That's a pretty sad day when AMD and Intel are supposedly sad also-rans to Huawei. Or, more reality based, they simply weren't even in competition for that space, had zero 5nm designs ready, and didn't prioritize the process.

rbanffy · on Oct 25, 2021

Well... Intel not having 5nm is entirely Intel's fault. They used process to their advantage and, well, when they messed up their process cadence, the advantage evaporated.

AMD could, but they seem to be very happy where they are. They also have to decide on which fronts they want to outcompete Intel and, it seems, process isn't one of them.

dragontamer · on Oct 25, 2021

> Qualcomm has loads of 5nm chips.

I think we all know that TSMC 5nm is quite a bit better than Samsung 5nm.

Samsung is "budget" 5nm. It ain't as good as the best-of-the-best that Apple is buying here.

laserlight · on Oct 25, 2021

> Process is nearly everything in performance/watt.

> TSMC 5nm is quite a bit better than Samsung 5nm.

These two statements conflict.

dragontamer · on Oct 25, 2021

> These two statements conflict.

TSMC 5nm is not the same process as Samsung 5nm though?

All the processes are the company's secret sauce. They aren't sharing the details. Ultimately, Samsung comes out and says "5nm technology", but that doesn't mean its necessarily competitive with TSMC 5nm.

Indeed, Intel 10nm is somewhat competitive against TSMC 7nm. The specific "nm" is largely a marketing thing at this point... and Intel is going through a rebranding effort. (Don't get me wrong: Intel is still far behind because it tripped up in 2016. But the Intel 14nm process was the best-in-the-world at that timeframe)

codedokode · on Oct 25, 2021

You can compare transistor count per mm^2 instead of nanometers.

dragontamer · on Oct 25, 2021

But you compare power-efficiency by how efficient each transistor is.

TSMC N5p is 10% more efficient and 5% higher clocks than TSMC N5. The same 5nm __BY THE SAME COMPANY__ can change 15.5% in just a year, as manufacturing issues are figured out.

Making every transistor 10% less power and 5% more GHz across the entire chip, while keeping the same size, is a huge bonus that cannot be ignored. I don't know what magic these chip engineers are doing, but they're surely spending some supercomputer on brute forcing all sorts of shapes/sizes of transistors to find the best density/power/speed tradeoffs per transistor.

This is part of the reason why Intel stuck with 14nm for so long. 14nm+++++++ kept increasing clock speeds, yields, and power efficiency (but not density), so it never really was "worth it" for Intel to switch to 10nm (which Intel had some customer silicon tapped out for years, but only at low clock speeds IIRC).

It isn't until recently that Intel seems to have figured out the clock speed issue and has begun offering mainstream chips at 10nm.

fomine3 · on Oct 26, 2021

Ref: https://www.anandtech.com/show/16463/snapdragon-888-vs-exyno...

GeekyBear · on Oct 25, 2021

> Process is nearly everything in performance/watt.

Not really. Apple's A15 and A14 phone chips are on the same process node.

>Apple A15 performance cores are extremely impressive here – usually increases in performance always come with some sort of deficit in efficiency, or at least flat efficiency. Apple here instead has managed to reduce power whilst increasing performance, meaning energy efficiency is improved by 17%

The efficiency cores of the A15 have also seen massive gains, this time around with Apple mostly investing them back into performance, with the new cores showcasing +23-28% absolute performance improvements

https://www.anandtech.com/print/16983/the-apple-a15-soc-perf...

dragontamer · on Oct 25, 2021

> Not really. Apple's A15 and A14 phone chips are on the same process node.

Yeah, you're talking about 20% performance changes on the same node.

Meanwhile, advancing a process from 7nm to 5nm TSMC is something like 45% better density (aka: 45% more transistors per mm^2) and 50% to 100% better power-efficiency at the same performance levels, and closer to the 100%-side of power-efficiency if you're focusing on idle / near-zero-GHz side of performance. (Pushing to 3GHz is less power of a power difference, but lower idles do have a sizable contribution in practice)

-----

Oh right: and TSMC N5P is 10% less power and 5% speed improvement over TSMC N5 (aka: what TSMC figured out in a year). There's the bulk of your 17% difference from A15 and A14.

Yeah, process matters. A LOT.

musicale · on Oct 25, 2021

Are you saying that if another company, say AMD, had access to TSMC's 5nm process than it would easily achieve comparable performance/watt to what Apple has done with the M1 series?

dragontamer · on Oct 25, 2021

I'm saying that 15.5% of the 17% difference from Apple A14 to Apple A15 is accounted for in the TSMC N5 to TSMC N5p upgrade (Aka: 10% fewer watts at 5% higher clock rates).

The bulk of efficiency gains has been, and for the foreseeable future will be, the efficiency of the underlying manufacturing process itself.

There's still a difference in efficiency above-and-beyond the 15.5% from the A14 to the A15. But its a small fraction of what the __process__ has given.

---------

Traditionally, AMD never was known for very efficient designs. AMD is more well known for "libraries", and more a plug-and-play style of chip-making. AMD often can switch to different nodes faster and play around with modular parts (see the Zen "chiplets"). I'd expect AMD to come around with some kind of chiplet strategy (or something along those lines) before I expect them in particular to take the efficiency crown.

NVidia probably would be better at getting high-efficiency designs. They're on a weaker 8nm Samsung process yet still have extremely good power/efficiency curves.

I like AMD's chiplet strategy though as a business, and as a customer. Its a bit of a softer benefit, and AMD clearly has made the "Infinity Fabric" more efficient than anyone expected it could get.

paulmd · on Oct 26, 2021

Zen4 will be going head-to-head with Apple A16 on N5P next year and it's pretty doubtful we see Zen4 come out ahead on perf/watt let alone IPC.

It's not all node advantage - Apple designed a much wider core than is feasible with x86. They have an insanely wide reorder buffer and many execution units, and can decode more instructions to keep it all fed. Even with the node shrink allowing you to throw more transistors at it, x86 poses obstacles to using the same approaches as Apple, and they've exhausted most of their own approaches.

rbanffy · on Oct 25, 2021

> Process is nearly everything in performance/watt.

ARM has consistently beat x86 in performance/watt at larger node sizes since the beginning. The first Archimedes had better floating point performance without a dedicated FPU than the then market-leading Compaq 386 WITH an 80387 FPU.

A lot of the extra performance of the M1 family has nothing to do with node, but with the fact the ARM ISA is much more amenable to a lot of optimizations that allow these chips to have surreally large reordering buffer, which, in turn, keep more of the execution ports busy at any given time, resulting in a very high ICP. Less silicon used to deal with a complicated ISA also leaves more space for caches, which are easier to manage (remember the more regular instructions), putting less stress on the main memory bus (which is insanely wide here, BTW). On top of that, the M1 family has some instructions that help make JavaScript code faster.

So, assume that Intel and AMD, when they get 5nm designs, will have to use more threads and cores to extract the same level of parallelism that the M1 does with an arm (no pun intended) tied behind its back.

dragontamer · on Oct 25, 2021

> optimizations that allow these chips to have surreally large reordering buffer

But only Apple's chip has a large reordering buffer. ARM Neoverse V1 / N1 / N2 don't have it, no one else is doing it.

Apple made a bet and went very wide. I'm not 100% sure if that bet is worth the tradeoffs. I'm certain that if other companies thought that a larger reordering buffer was useful, they'd have done it.

I'll give credit to Apple for deciding that width still had places to grow. But its a very weird design. Despite all that width, Apple CPUs don't have SMT, so I'd expect that a lot of the performance is "wasted" with idle pipelines, and that SMT would really help out the design.

Like, who makes an 8-wide chip that supports only 1 thread? Apple but... no one else. IBM's 8-wide decode is on a SMT4 chip (4-threads per core).

rbanffy · on Oct 25, 2021

SMT is a good way to extract parallelism when your ISA makes it more difficult to do (with speculative execution/register renaming). ARM, it seems, makes it easier to the point I don't think any ARM CPU has been using multiple threads per core.

I would expect POWER to be more amenable to it, but x86 borrows heavily from the 8085 ISA and was designed at a time the best IPC you could hope to get was 1.

pthariensflame · on Oct 26, 2021

Minor aside: Arm does, in fact, have a recent CPU family with 2-way SMT: Cortex-A65(AE)/Neoverse E1.

saberience · on Oct 25, 2021

> I don't expect the M1 Pro to have very good double-precision GPU-speeds.

Compared to what? There are no laptops quite like these new Apple laptops. Anything with faster graphics also uses LOADS more power and runs WAY hotter.

dragontamer · on Oct 25, 2021

> Compared to what? There are no laptops quite like these new Apple laptops. Anything with faster graphics also uses LOADS more power and runs WAY hotter.

Using 2x the power for 2x the bandwidth (on top of significantly more compute power) is a good tradeoff, when the NVidia chip is 8nm Samsung vs Apple's 5nm TSMC.

In any case, the actual video game performance is much much worse on the M1 Pro. The benchmarks show that the chip has potential, but games need to come to the system first before Apple can decidedly claim a victory.

GeekyBear · on Oct 25, 2021

> the actual video game performance is much much worse on the M1 Pro

Well, no. The emulated x86 gaming performance is.

They didn't test a game with a native version.

dragontamer · on Oct 25, 2021

> They didn't test a game with a native version.

If the native version doesn't exist then... gamers don't care?

Gotta get those games ported over

rbanffy · on Oct 25, 2021

> If the native version doesn't exist then... gamers don't care?

I don't think it's a fair assessment of the machine capabilities. Also, games WILL be ported to the platform AND if you really need your games running at full speed, you can keep the current computer and postpone the purchase of your Mac until the games you need are available.

dragontamer · on Oct 25, 2021

No.

Next-generation games will be made on the platform. Current-generation and last-generation games no longer have much support / developers, and no sane company will spend precious developer time porting over a year-old or 5-year-old game to a new platform in the hopes of a slim set of sales. (Except maybe Skyrim. Apparently those ports keep making money)

Your typical game studio doesn't work on Skyrim though. They put in a bunch of developer work into a game, then by the time the game is released, all the developers are on a new project.

GeekyBear · on Oct 25, 2021

Have you seen how terrible the x86 emulated performance is on a Surface Pro X?

https://www.youtube.com/watch?v=OhESSZIXvCA

dragontamer · on Oct 25, 2021

And that's why gamers are buying the Surface Book instead?

The "gamer" community (or really, community-of-communities) only cares if their particular game runs quickly on a particular platform.

Gamers don't really care about the advanced technology details, aside from the underlying "which system will run my game faster, with higher-quality images" (4k / raytracing / etc. etc.)?

GeekyBear · on Oct 25, 2021

No, that's why having x86 emulation performance be this good is a minor miracle.

Native performance would be expected to be inline with what the benchmarks are showing.

The MacBook Pro Max would beat the 100 watt mobile variant of the 3080, especially if you unplug both laptops from the wall where the 3080 has to throttle down and the MacBook does not.

dragontamer · on Oct 25, 2021

> No, that's why having x86 emulation performance be this good is a minor miracle.

No gamer is going to pay $3000+ for a laptop with emulation when $2000+ gamer laptops are faster at the task (aka: video games are faster on the $2000 laptop).

------

Look, gamers don't care about all games. They only care about that one or two games that they play. If you want to attract Call of Duty players, you need to port Call-of-Duty over to the Mac, native, so that the game actually runs faster on the system.

It doesn't need to be an all-or-nothing deal. Emulation is probably good enough for casuals / non-gamers who maybe put in 20 hours or less into any particular game. But anyone putting 100-hours or more into a game will probably want the better experience.

GeekyBear · on Oct 25, 2021

> No gamer is going to pay $3000+ for a laptop with emulation

They pay $3000 for a laptop whose fans hit 55 decibels at load and that has to throttle way down slower than the MacBook if you use it like a laptop and go somewhere without a power outlet.

https://www.anandtech.com/show/16928/the-msi-ge76-raider-rev...

dragontamer · on Oct 25, 2021

The Mac doesn't even do raytracing, does it? So you're already looking at a sizable quality downgrade over AMD, NVidia, PS5, and XBox Series X.

I think the eSports gamers will prefer FPS over graphical fidelity, so maybe that's the target audience for this chip ironically.

But adventure gamers who want to explore raytraced worlds / prettier games will prefer the cards with raytracing, better shadows, etc. etc. (See the Minecraft RTX demo for instance: https://www.youtube.com/watch?v=1bb7wKIHpgY)

pjmlp · on Oct 25, 2021

It does,

https://developer.apple.com/videos/play/wwdc2021/10149/

https://developer.apple.com/videos/play/wwdc2021/10150/

dragontamer · on Oct 25, 2021

Look, my Vega64 raytraces all the time when I hit the "Render" button on Blender.

But video-game raytracing is about hardware-dedicated raytracing units. Software (even GPU-software rendering) is an order of magnitude slower. Its still useful to implement, but what PS5 / XBox Series X / AMD / NVidia has implemented are specific raytracing cores (or in AMD's case: raytracing instructions) to traverse a BVH-tree and accelerate the raytracing process.

"Can do Raytracing" or "Has an API for GPU-software that does raytracing" is just not the same as "we built a raytracing core into this new GPU". I'm sure Apple is working on their raytracing cores but I haven't seen anything yet that suggests that its ready yet.

rbanffy · on Oct 25, 2021

> the actual video game performance is much much worse on the M1 Pro

This is a workstation. For games one should look for a Playstation ;-)

2x power also means half the battery life. Remember this is a portable computer that's thin and light beyond what would be reasonable considering its performance. Also, remember the GPU has full 400GBps access to all of the RAM, which means models of up to 64GB won't need to pass over the PCIe bus.

Siira · on Nov 5, 2021

(The GPU probably can’t saturate the 400GPps.)

Veedrac · on Oct 26, 2021

A 3080 Mobile only has a bandwidth of 448 GB/s. You quoted the desktop number (which also has a different die).

fomine3 · on Oct 26, 2021

3080 mobile is confusing brand name. It also uses GA104 rather than GA102, so it isn't totally equivalent to 3080 desktop.