This isn’t off the shelf but you can rent enormous FPGA’s from amazon. You can dispatch ~6800 dsp slices at 500MHz for a hot ~1.7Tu32op/s. Each FPGA can have 4 DDR4 controllers for you to saturate with memory ops. You can cram ~800 vexriscv cores in there (possibly more, i’m extrapolating from less featured fpga architectures) for a minimum of 220 GIPS. Or about 1/7th of a threadripper 3990X (dhrystone ips).
I think that those vector functional units more than pay for themselves. If you combined them all on that alder lake die you linked you would get _one_ extra core of space. Every recent manycore mesh I know of packs in a vector or tensor unit because compute circuits are actually pretty small for the punch they pack! You vastly reduce dispatch overhead which is the main source of slowdown once you saturate your cache hierarchy. Distributing memory access in a manycore is frustratingly slow.
I think that those vector functional units more than pay for themselves. If you combined them all on that alder lake die you linked you would get _one_ extra core of space. Every recent manycore mesh I know of packs in a vector or tensor unit because compute circuits are actually pretty small for the punch they pack! You vastly reduce dispatch overhead which is the main source of slowdown once you saturate your cache hierarchy. Distributing memory access in a manycore is frustratingly slow.