Hacker Newsnew | past | comments | ask | show | jobs | submit | zozbot234's commentslogin

Nothing obviously prevents using this approach, e.g. for 3B-active or 10B-active models, which do run on consumer hardware. I'd love to see how the 3B performs with this on the MacBook Neo, for example. More relevantly, data-center scale tokens are only cheaper for the specific type of tokens data centers sell. If you're willing to wait long enough for your inferences (and your overall volume is low enough that you can afford this) you can use approaches like OP's (offloading read-only data to storage) to handle inference on low-performing, slow "edge" devices.

It's less of a "performance falls off a cliff" problem and more of a "once you offload to RAM/storage, your bottleneck is the RAM/storage and basically everything else no longer matters". This means if you know you're going to be relying on heavy offload, you stop optimizing for e.g. lots of VRAM and GPU compute since that doesn't matter. That saves resources that you can use for scaling out.

It depends on the model and the mix. For some MoE models lately it’s been reasonably fast to offload part of the processing to CPU. The speed of the GPU still contributes a lot as long as it’s not too small of a relative portion of compute.

No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.

SSD bandwidth will ultimately be limited by the amount of PCIe lanes you have available (for something other than the Apple Silicon internal storage). So the approach has inherent limitations. You can of course scale out to multiple systems to get more throughput.

You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)


Yeah, PCIe is the bottleneck. The point being that whether the data originates from RAM or from NVME or Optane, you cannot get data to the GPU faster with RAM than with SSDs.

Meanwhile PCIe switches exist. So why not build:

1 CPU + memory + ...

N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)

Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.

Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.


The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.

Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.

The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box? Alternatively, one might use a simple mmap and then something like posix_fadvise to set up prefetching of the data.

There's always priors, they're just "flat", uniform priors (for maximum likelihood methods). But what "flat" means is determined by the parameterization you pick for your model. which is more or less arbitrary. Bayesians would call this an uninformative prior. And you can most likely account for stronger, more informative priors within frequentist statistics by resorting to so-called "robust" methods.

It’s not true that “there are always priors”. There are no priors when you calculate the area of a triangle, because priors are not a thing in geometry. Priors are not a thing in frequentist inference either.

You may do a Bayesian calculation that looks similar to a frequentist calculation but it will be conceptually different. The result is not really comparable: a frequentist confidence interval and a Bayesian credible interval are completely different things even if the numerical values of the limits coincide.


Frequentist confidence intervals as generally interpreted are not even compatible with the likelihood principle. There's really not much of a proper foundation for that interpretation of the "numerical values".

What does “as generally interpreted” mean? There is one valid way to interpret confidence intervals. The point is that it’s not based on a posterior probability and there is no prior probability there either.

First, there is not such thing as a ‘uninformative’ prior; it’s a misnomer. They can change drastically based on your paramerization (cf change of variables in integration).

Second, I think the nod to robust methods is what’s often called regularization in frequentist statistics. There are cases where regularization and priors lead to the same methodology (cf L1 regularized fits and exponential priors) but the interpretation of the results is different. Bayesian claim they get stronger results but that’s because they make what are ultimately unjustified assumptions. My point is that if they were fully justified, they have to use frequentist methods.


One standard way to get uninformative priors is to make them invariant under the transformation groups which are relevant given the symmetries in the problem.

A standard DB ala Postgres will be a perfectly functional graph database unless you're doing very specialized network analysis queries, which is not what most of these "knowledge graph" databases are being used for. It's only querying and data modeling that's a bit fiddly (expressing the "graph" structure using SQL) and that's being improved by the new Property Graph Query (PGQ) in the latest SQL standards.

This is the same topic I had an intense argument with my coworkers at the company formerly called FB a decade ago. There is a belief that most joins are 1-2 deep. And that many hop queries with reasoning are rare and non-existent.

I wonder how you reconcile the demand for LLMs with multihop reasoning with the statement above.

I think a lot what is stated here is how things work today and where established companies operate.

The contradictions in their positions are plain and simple.


There are worst-case optimal algorithms for multi-way and multi-hop joins. This does not require giving up the relational model.

I maintain LadybugDB which implements WCOJ (inherited from the KuzuDB days). So I don't disagree with the idea. Just that it's a graph database with relational internals and some internal warts that makes it hard to compose queries. Working on fixing them.

https://github.com/LadybugDB/ladybug/discussions/204#discuss...


Also an important test is the check on whether it's WCOJ on top of relational storage or is the compressed sparse row (CSR) actually persisted to disk. The PGQ implementations don't.

There are second order optimizations that LLMs logically implement that CSR implementing DBs don't. With sufficient funding, we'll be able to pursue those as well.


It'd be great if PG came with a serverless/embeddable mode, that'd be the main missing thing in comparison to this tool.

I know pglite, and while it's great someone made that, it's definitely not the same


I maintain a fork of pgserver (pglite with native code). It's called pgembed. Comes with many vector and BM25 extensions.

Just in case folks here were wondering if I'm some type of a graphdb bigot.


That's coming to Postgres 19 this year, had a brief exchange with a committer earlier this week and it's actually available in the Postgres repo to try (need to run your own build of course). Very exciting development!

MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance

Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way.

EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.


you are correct, I did forget to add quad. you should join us in r/localllama

check out what other people are getting. you're welcome.

https://www.reddit.com/r/LocalLLaMA/comments/1nunq7s/gptoss1... https://www.reddit.com/r/LocalLLaMA/comments/1p4evyr/most_ec...


Thanks for the confirmation, wasn't sure if I was just going a bit senile heh. Yeah, I love /r/localllama, some of the best actual practitioners of this stuff on the internet. Also, crazy awesome frankenrigs to try and get that many huge cards working together.

I was considering picking up a couple of the 48 gig 4090/3090s on an upcoming trip to China, but I just ended up getting one of the Max-Q's. But maybe the token throughput would still be higher with the 4090 route? Impressive numbers with those 3090s!

What's the rig look like that's hosting all that?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: