Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The scale they are quoting at 100,000 chip clusters and 65 exaflops seems impossible. At 800W per chip, that's 80MW of power! Unless they literally built an entire DC of these things, nobody is training anything on the entire cluster at once. It's probably 10-20 separate datacenters being combined for marketing reasons here.


What makes you think it's 800W per chip?


It's about what the I though the H100 was, that's 700W actually. But even at say, 400W, that's 40MW of power. I guess some datacenters are built in the 40-100MW range from some quick googling, but I really doubt they actually can network 100,000 chips together in any sort of performant way, that's supercomputer level interconnect. I don't think most datacenters support highly interlinked network interconnect like this would need either.


They have instances with 16 chips so I presume there are at least 16 chips per server. I'd also expect the power consumption to be more like 100-200W given they seem more like Google's TPUs than a H100.

For the interconnect I doubt this is their typical interconnect but it doesn't seem completely unreasonable. Even when not running massive clusters they'll still need the interconnect to pair the random collections of machines that people are using.


I don’t know - apparently they are watercooling this gen: https://www.servethehome.com/aws-graviton4-is-an-even-bigger...

You don’t watercool 200W chips typically, and you can in theory air cool 8x 800 watt nvidia h100s in a single system. These are also 4-5u systems!

16 chips in one node would be ambitious, I would expect the 16 chip offering to really be several closely located nodes in the same rack/nearby.


> 16 chips in one node would be ambitious, I would expect the 16 chip offering to really be several closely located nodes in the same rack/nearby.

I'd expect it to be like Google's TPUs which have 4 "chips" in a "pod", attaching 4 of these pods to a single system doesn't seem unreasonable.

Looking at the corrosponding CPU and RAM of the available instance types it looks like they're using 32-core CPUs in dual socket systems.


Yeah that seems like a likely setup to me!


Per server? Our dual CPU intel servers take about 800-900W at full power


Per chip - they are are watercooling these babies: https://www.servethehome.com/aws-graviton4-is-an-even-bigger...

(Just like we had to do at microsoft for maia with the sidekick rack of just cooling: https://www.datacenterfrontier.com/machine-learning/article/...)


Well, think of it this way -- individual 1U servers can easily consume 1000W, or 1kW. Put about forty of those in a single rack, and that's 40kW. Divide 80MW for the datacenter by 40kW per rack and that's not very many racks to comprise the entire datacenter, right?


That’s still 2000 racks, that’s not nothing?


No, 2000 racks is not nothing.

But I would say that's a pretty small datacenter, wouldn't you?


The footprint for 2000 racks would be over 1000m2; when you add the necessary spacing as well as supplementary utilities (power/networking) that probably means double that footprint.

I guess at the scale those companies are operating it's not that big, but that's still quite a large building !


I actually have no idea to be honest!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: