More

genpfault · 2026-04-22T15:52:14 1776873134

Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:

    $ llama-server --version
    version: 8851 (e365e658f)

    $ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.529 |   654.11 |    3.470 |    36.89 |    4.999 |   225.67 |
    |  2000 |    128 |    1 |   2128 |    3.064 |   652.75 |    3.498 |    36.59 |    6.562 |   324.30 |
    |  4000 |    128 |    1 |   4128 |    6.180 |   647.29 |    3.535 |    36.21 |    9.715 |   424.92 |
    |  8000 |    128 |    1 |   8128 |   12.477 |   641.16 |    3.582 |    35.73 |   16.059 |   506.12 |
    | 16000 |    128 |    1 |  16128 |   25.849 |   618.98 |    3.667 |    34.91 |   29.516 |   546.42 |
    | 32000 |    128 |    1 |  32128 |   57.201 |   559.43 |    3.825 |    33.47 |   61.026 |   526.47 |

johndough · 2026-04-22T17:07:28 1776877648

Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.684 |  1462.61 |    2.869 |    44.61 |    3.553 |   317.47 |
    |  2000 |    128 |    1 |   2128 |    1.390 |  1438.84 |    2.868 |    44.64 |    4.258 |   499.80 |
    |  4000 |    128 |    1 |   4128 |    2.791 |  1433.18 |    2.886 |    44.35 |    5.677 |   727.11 |
    |  8000 |    128 |    1 |   8128 |    5.646 |  1416.98 |    2.922 |    43.80 |    8.568 |   948.65 |
    | 16000 |    128 |    1 |  16128 |   11.851 |  1350.10 |    3.007 |    42.57 |   14.857 |  1085.51 |
    | 32000 |    128 |    1 |  32128 |   25.855 |  1237.66 |    3.168 |    40.40 |   29.024 |  1106.96 |

Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.

cpburns2009 · 2026-04-22T18:09:41 1776881381

~25-26 tok/s with ROCm using the same card, llama.cpp b8884:

    $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.034 |   966.90 |    4.851 |    26.39 |    5.885 |   191.67 |
    |  2000 |    128 |    1 |   2128 |    2.104 |   950.38 |    4.853 |    26.38 |    6.957 |   305.86 |
    |  4000 |    128 |    1 |   4128 |    4.269 |   937.00 |    4.876 |    26.25 |    9.145 |   451.40 |
    |  8000 |    128 |    1 |   8128 |    8.962 |   892.69 |    4.912 |    26.06 |   13.873 |   585.88 |
    | 16000 |    128 |    1 |  16128 |   19.673 |   813.31 |    4.996 |    25.62 |   24.669 |   653.78 |
    | 32000 |    128 |    1 |  32128 |   46.304 |   691.09 |    5.122 |    24.99 |   51.426 |   624.75 |

ozgrakkurt · 2026-04-23T07:11:07 1776928267

Did you try GPU/CPU mix with a bigger model?

genpfault · 2026-04-28T00:22:01 1777335721

Prompt processing is absolutely punishing:

    ./llama-batched-bench -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ4_NL -npp 1000 -ntg 128 -npl 1 --cache-type-k q8_0 --cache-type-v q8_0 -c 18000 --n-cpu-moe 32
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |   53.961 |    18.53 |    9.223 |    13.88 |   63.184 |    17.85 |

genpfault · 2026-04-21T20:08:11 1776802091

Hear, hear!

genpfault · 2026-04-03T02:14:33 1775182473

llama.cpp (b8642) auto-fits ~200k context on this 24GB RX 7900 XTX & it shows a solid 100+ tok/s ("S_TG t/s") on the first 32k of it, nice!

    ./llama-batched-bench -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    -npp 1000,2000,4000,8000,16000,32000,64000,96000,128000 -ntg 128 -npl 1 -c 0
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.416 |  2404.87 |    1.064 |   120.29 |    1.480 |   762.20 |
    |  2000 |    128 |    1 |   2128 |    0.755 |  2649.86 |    1.075 |   119.04 |    1.830 |  1162.83 |
    |  4000 |    128 |    1 |   4128 |    1.501 |  2665.72 |    1.093 |   117.08 |    2.594 |  1591.49 |
    |  8000 |    128 |    1 |   8128 |    3.142 |  2545.85 |    1.114 |   114.87 |    4.257 |  1909.47 |
    | 16000 |    128 |    1 |  16128 |    6.908 |  2316.00 |    1.189 |   107.65 |    8.097 |  1991.73 |
    | 32000 |    128 |    1 |  32128 |   16.382 |  1953.31 |    1.278 |   100.12 |   17.661 |  1819.16 |
    | 64000 |    128 |    1 |  64128 |   43.427 |  1473.74 |    1.453 |    88.12 |   44.879 |  1428.89 |
    | 96000 |    128 |    1 |  96128 |   82.227 |  1167.50 |    1.623 |    78.86 |   83.850 |  1146.42 |
    |128000 |    128 |    1 | 128128 |  133.237 |   960.69 |    1.797 |    71.25 |  135.034 |   948.86 |

spwa4 · 2026-04-03T20:12:20 1775247140

~50 tok/s on M1 Max 64Gb

danielhanchen · 2026-04-03T12:04:55 1775217895

Oh nice that's pretty good!

genpfault · 2026-03-30T16:04:35 1774886675

Doesn't seem to serve rendered samples so you have to set "browser.display.use_document_fonts" to "1" to see anything useful.

speedgoose · 2026-03-30T17:36:48 1774892208

I think it also requires internet access, so you have to enable internet.

jeffbee · 2026-03-30T16:30:16 1774888216

Which is the default, and 99.9% of Firefox users, 99.99% of all users will not have this issue.

genpfault · 2026-03-26T15:27:41 1774538861

600 GB/s of memory bandwidth isn't anything to sneeze at.

~$1000 for the Pro B70, if Microcenter is to be believed:

https://www.microcenter.com/product/709007/intel-arc-pro-b70...

https://www.microcenter.com/product/708790/asrock-intel-arc-...

hedgehog · 2026-03-26T16:16:30 1774541790

Recent kernels have SR-IOV support for these chips too. B&H has them listed for $950.

https://www.bhphotovideo.com/c/product/1959142-REG/intel_33p...

When 32GB NVIDIA cards seem to start at around $4000 that's a big enough gap to be motivating for a bunch of applications.

robotnikman · 2026-03-26T21:03:30 1774559010

I'm probably going to snag one of the Intel cards just for the SR-IOV and use with VM's

scrubs · 2026-03-27T00:44:20 1774572260

I tried to use SRIOV to virtualize mellanox nics with vlans on redhat Linux. Long story short it did not work. Per Nvidia the os has to also run open switch. This work was on an already complex setup in finance ... so adding open switch was considered too much additionally complexity. This requirement is not something I run across in the docs.

Anybody know better?

hedgehog · 2026-03-28T18:13:40 1774721620

The situation in networking is a lot different than graphics. I don't know much other than that it depends on what specific protocol, card, firmware, and network topology you're using and there's not really generic advice. If the question is setting up Ethernet switching inside the card so VFs can talk to the network, then I think the Linux switchdev tools can configure that on their own without Open vSwitch but you probably need to find someone who understands your specific type of deployment for better advice.

hedgehog · 2026-03-27T00:25:17 1774571117

Depending what you're doing AMD's support for VirtIO Native Context might be a useful alternative (I think it gives less isolation which could be good or bad depending on use).

jauntywundrkind · 2026-03-26T17:31:17 1774546277

I tend to agree that the vram size and bandwidth is the core thing, but this B70 Pro allegedly has 387 int8 tops vs a 5090 having 3400 int8 tops. 600 compares vs 1792GB/s. I'm delighted so see an option with quarter the price! But man, a tenth the performance? https://www.techpowerup.com/347721/sparkle-announces-intel-a... https://www.tomshardware.com/pc-components/gpus/nvidia-annou...

ColonelPhantom · 2026-03-26T20:44:05 1774557845

838 seems to be the real INT8 TOPS number for the 5090; going from 800 to 3400 takes an x2 speedup for sparsity (so skipping ops) and another x2 speedup for FP4 over INT8.

So it's closer to half the speed than a tenth. Intel also seems to be positioning this card against the RTX PRO 4000 Blackwell, not the 5090, and that one gets more like 300 INT8 TOPS. It also has less memory but at a slightly higher bandwidth. The 5090 is much faster and IIRC priced similarly to the PRO 4000, but is also decidedly a consumer product which, especially for Nvidia, comes with limitations (e.g. no server-friendly form factor cards available, and there are or used to be driver license restrictions that prevented using a consumer card in a data center setup).

jauntywundrkind · 2026-03-26T20:52:20 1774558340

Thank you for the correction. That seemed way too lopsided to be believed. This assessment balances the memory to tops ratio much much more evenly, which is to be expected! I was low key hoping someone would help me make sense of how wildly disparate figures were, but I wasn't seeing.

AMD R9700 is 378/766 tops int8 dense/sparse. 644GB/s of 32GB memory. ~$1400. To throw one more card into the mix. Intel undercutting that nicely here.

You're right that for companies, the pro grade matters. For us mere mortals, much less so. Features like sr-iov however are just fantastic so see! Good job Intel. AMD has been trickling out such capabilities for a decade (cards fused for "MxGPU" capability) & it makes it such an easier buy to just offer it straight up across the models.

adgjlsfhk1 · 2026-03-26T19:53:57 1774554837

especially for exploratory work 1/10th the perf is fine. Intel isn't able to compete head to head with Nvidia (yet), but vram is capability while speed is capacity. There will be plenty of use cases where the value prop here makes sense.

wmf · 2026-03-26T19:38:27 1774553907

It's more like a 70 class card with extra VRAM.

qingcharles · 2026-03-26T15:52:31 1774540351

I think the B65 is priced at $650. Both supported by llamacpp I believe. With that power draw you could run two of them.

giancarlostoro · 2026-03-26T15:57:55 1774540675

Intel GPU prices have stayed fine, but I do wonder if they are viable for Inference if they will wind up like Nvidia GPUs, severely overpriced.

cmovq · 2026-03-26T18:43:56 1774550636

I mean it kind of is considering that's comparable to a 5070 which has 672 GB/s? Benefit of NVIDIA being the only one using GDDR7 for now I guess.

daemonologist · 2026-03-26T18:48:09 1774550889

7800 XT has 624 GB/s as well, and can be found for $400 used. 16 GB of course.

BizarroLand · 2026-03-27T17:20:23 1774632023

I've heard ROCm is still a crapshoot though. Is that true?

daemonologist · 2026-03-28T13:26:07 1774704367

If you stick with your OS/package manager-distributed version, installation isn't painful anymore (provided that version approximately overlaps with your generation of GPU). It's okay for inference, and okay for training if you don't stray too far beyond plain torch. If you want to run code from a paper or other more esoteric stuff you're still going to have a bad time.

I don't have an Intel dGPU, but I suspect the situation there is even worse. I mean you go to the torch homepage: https://pytorch.org/get-started/locally/ and Intel isn't even mentioned. (It's here though: https://docs.pytorch.org/docs/stable/notes/get_start_xpu.htm...)

varispeed · 2026-03-26T19:52:44 1774554764

The product would be excellent in 2024, but now it's a landfill filler. You can run some small models at pedestrian speed, novelty wears off and that's it.

Intel is not looking in the future. If they released Arc Pro B70 with 512GB base RAM, now that could be interesting.

32GB? Meh.

throwaway85825 · 2026-03-27T04:17:47 1774585067

It's true that it's severely late and missed it's market window but 512gb just isn't possible.

genpfault · 2026-03-20T19:12:57 1774033977

Not to be confused with GNU parallel[1], written in Perl.

[1]: https://en.wikipedia.org/wiki/GNU_parallel

genpfault · 2026-03-14T05:40:24 1773466824

Speculative decoding[1]?

[1]: https://github.com/ggml-org/llama.cpp/blob/master/docs/specu...

genpfault · 2026-03-11T21:58:30 1773266310

> A social networking system simulates a user using a language model trained using training data generated from user interactions performed by that user

Google People[1]?

[1]: https://qntm.org/perso

genpfault · 2026-02-19T18:55:53 1771527353

> "A system's purpose is what it does"

POSIWID: https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...

genpfault · 2026-02-09T03:01:36 1770606096

https://www.redblobgames.com/articles/curved-paths/

Ef996 · 2026-02-09T08:59:49 1770627589

Yes, I don't even know how I didn't know about this at the time of wiring the article. But a must read for sure!