After you load the weights into the GPU and keep the KV cache there too, you don... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		dist-epoch 3 months ago \| parent \| context \| favorite \| on: Big GPUs don't need big PCs After you load the weights into the GPU and keep the KV cache there too, you don't need any other significant traffic.

numpad0 3 months ago [–]

Even in tensor parallel modes? I thought it could only work if you're fine stalling all but n GPU for n users at any given moments.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact