The earliest Intel chips (8086) had an extra add-on to do floating point math in hardware, instead of software (8087). There’s a lot of “AI” workloads that include a lot of matrix multiplication, which the Google TPU and the Nvidia tensor cores implement in hardware instead of software.
Exactly, like how a “bitcoin mining” chip would implement the SHA in hardware.
And, CPUs prioritize hiding latency with all sorts of caches, and GPUs prioritize cores and bandwidth to hide latency, so there’s different tradeoffs about memory bandwidth versus latency.
Thank you. Think we've hit the level of my "run away scared at the first sight of machine code" understanding, but I now vaguely understand what's going on.