I know one can rent consumer GPUs on the internet, where people like you and me ...

qeternity · on April 13, 2025

> Would this even be possible to do, given that the latency between nodes is very high and the bandwidth limited?

Yes, it's possible. But no, it would not be remotely sensible given the performance implications. There is a reason why Nvidia is a multi trillion dollar company, and it's as much about networking as it is about GPUs.

kmeisthax · on April 13, 2025

Back in the early days of AI art, before AI became way too cringe to think about, I wondered about this exact thing[0]. The problem I learned later is that most AI training (and inference) is not dependent so much on the GPU compute, but on memory bandwidth and communication. A huge chunk of AI training is just figuring out how to minimize or hide the bottleneck the inter-GPU interconnect imposes so you can scale to multiple cards.

The BOINC model of distributed computing is to separate everything into little work units that can be sent out to multiple machines who then return a result that can be integrated back into the whole. If you were to train foundation models this way, you'd be packaging up the current model state n and a certain amount of trainset items into a work unit, and the result would be model weight offsets to be added back into model state n+1. But you wouldn't be able to benefit from any of the gradients calculated by other users until they submitted their work units and n+1 got calculated. So there'd be a lot of redundant work and training progress would slow down, versus a closely-coupled set of GPUs where they have enough bandwidth to exchange gradients every batch.

For the record, I never actually built a distributed training cluster. But when I learned what AI actually wants to go fast, I realized distributed training probably couldn't work over just renting big GPUs.

Most people do not have GPUs with enough RAM to do meaningful AI work. Generative AI models work autoregressively: that is, all of their weights are repeatedly used in a tight loop. In order for a GPU to provide a meaningful speedup it needs to have the whole model in GPU memory, because PCIe is slow (high latency) and also slow (low bandwidth). Nvidia knows this and that's why they are very stingy on GPU VRAM. Furthermore, training a model takes more memory than merely running it; I believe gradients are something like the number of weights times your batch size in terms of memory usage. There's two ways I could see around this, both of which are going to cause further problems:

- You could make 'mini' workunits where certain specific layers of the model are frozen and do not generate gradients. So you'd only train, say, 10% of the model at any one time. This is how you train very large models in centralized computing; you put a slice of the model on each GPU and exchange activations and gradients each batch. But we're on a distributed computer, so we don't have that kind of tight coupling, and we converge slower or not at all if we do this.

- You can change the model architecture to load specific chunks of weights at each layer, with another neural network to decide what chunks to load for each token. This is known as a "Mixture of Experts" model and it's the most efficient way we know of to stream weights in and out of a GPU, but training has to be aware of it and you can't change the size of the chunks to fit the current GPU. MoE lets a model have access to a lot of weights, but the scaling is worse. e.g. an 8x44B parameter MoE model is NOT equivalent to a 352B non-MoE model. It also causes problems with training that you have to solve for: very common bits of knowledge will be replicated across chunks, and certain chunks can become favored by the model because they're getting more gradients, which causes them to be favored more, so they get more gradients.

[0] My specific goal was to train a txt2img model purely on public domain Wikimedia Commons data, which failed for different reasons having to do with the fact that most of AI is just dataset sorting.