Hacker Newsnew | past | comments | ask | show | jobs | submit | andrewgross's commentslogin

Super easy to get started, but lacking for larger datasets where you want to understand a bit more about predictions. You generally lose things like prediction probability (though this can be recovered if you chop the head off and just assign output logits to classes instead of tokens), repeatability across experiments, and the ability to tune the model by changing the data. You can still do fine tuning, though itll be more expensive and painfaul than a BERT model.

Still, you can go from 0 to ~mostly~ clean data in a few prompts and iterations, vs potentially a few hours with a fine tuning pipeline for BERT. They can actually work well in tandem to bootstrap some training data and then use them together to refine your classification.


> The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge.

I was under the impression that this was not how MoE models work. They are not a collection of independent models, but instead a way of routing to a subset of active parameters at each layer. There is no "expert" that is loaded or unloaded per question. All of the weights are loaded in VRAM, its just a matter of which are actually loaded to the registers for calculation. As far as I could tell from the Deepseek v3/v2 papers, their MoE approach follows this instead of being an explicit collection of experts. If thats the case, theres no VRAM saving to be had using an MOE nor an ability to extract the weights of the expert to run locally (aside from distillation or similar).

If there is someone more versed on the construction of MoE architectures I would love some help understanding what I missed here.


Not sure about DeepSeek R1, but you are right in regards to previous MoE architectures.

It doesn’t reduce memory usage, as each subsequent token might require different expert buy it reduces per token compute/bandwidth usage. If you place experts in different GPUs, and run batched inference you would see these benefits.


Is there a concept of an expert that persists across layers? I thought each layer was essentially independent in terms of the "experts". I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though.

I could be very wrong on how experts work across layers though, I have only done a naive reading on it so far.


  I suppose you could look at what part of each layer was most likely to trigger together and segregate those by GPU though
Yes, I think that's what they describe in section 3.4 of the V3 paper. Section 2.1.2 talks about "token-to-expert affinity". I think there's a layer which calculates these affinities (between a token and an expert) and then sends the computation to the GPUs with the right experts.

This doesn't sound like it would work if you're running just one chat, as you need all the experts loaded at once if you want to avoid spending lots of time loading and unloading models. But at scale with batches of requests it should work. There's some discussion of this in 2.1.2 but it's beyond my current ability to comprehend!


Ahh got it, thanks for the pointer. I am surprised there is enough correlation there to allow an entire GPU to be specialized. I'll have to dig in to the paper again.


It does. They have 256 experts per MLP layer, and some shared ones. The minimal deployment for decoding (aka. token generation) they recommend is 320 GPUs (H800). It is all in the DeepSeek v3 paper that everyone should read rather than speculating.


Got it. I’ll review the paper again for that portion. However, it still sounds like the end result is not VRAM savings but efficiently and speed improvements.


Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understandable. Combined, they reach some magic number people can talk about (10x!): FP8: ~1.6 to 2x faster than BF16 / FP16; MLA: cut KV cache size by 4x (I think); MTP: converges 2x to 3x faster; DualPipe: maybe ~1.2 to 1.5x faster.

If you look deeper, many of these are only applicable to training (we already do FP8 for inference, MTP is to improve training convergence, and DualPipe is to overlapping communication / compute mostly for training purpose too). The efficiency improvement on inference IMHO is overblown.


  we already do FP8 for inference
Yes but, for a given size of model, Deepseek claims that a model trained with FP8 will work better than a model quantized to FP8. If that's true then, for a given quality, a native FP8 model will be smaller, and have cheaper inference.


I don't think entire GPU is specialised nor a singular token will use the same expert. I think about it as a gather-scatter operation at each layer.

Let's say you have an inference batch of 128 chats, at layer `i` you take the hidden states, compute their routing, scatter them along with the KV for those layers among GPUs (each one handling different experts), the attention and FF happens on these GPUs (as model params are there) and they get gathered again.

You might be able to avoid the gather by performing the routing on each of the GPUs, but I'm generally guessing here.


  If you place experts in different GPUs
Right, this is described in the Deepseek V3 paper (section 3.4 on pages 18-20).


7950x3d w/ 128GB at stock timings (~3200MT/s?). Showed a high base but no increase with threads, need to investigate what is happening.

  1 54.7
  2 50.6
  3 49.6
  4 48.2
  5 47.9
  6 47.4
  7 47.1
  8 46.6
  9 46.5
  10 46.2
  11 46.1
  12 45.9
  13 45.8
  14 45.7
  15 45.7
  16 45.7
  17 45.7
  18 45.8
  19 45.9
  20 45.8
  21 45.8
  22 45.6
  23 45.6
  24 45.5
  25 45.5
  26 45.5
  27 45.5
  28 45.4
  29 45.4
  30 45.4
  31 45.4
  32 45.4


Ryzen 9 7900 2x48GB AMD recommends DDR5-6000, so I underclocked DDR5-6400 to DDR5-6000. The 96GB ram was under $300.

$ ./a.out

  1 59.2 
  2 76.9 
  3 66.8 
  4 68.7 
  5 64.0 
  6 67.1 
  7 63.9 
  8 66.0 
  9 64.0 
  10 65.6 
  11 64.0 
  12 65.5 
  13 65.6 
  14 66.0 
  15 66.0 
  16 65.8 
  17 65.1 
  18 65.2 
  19 65.1 
  20 65.4 
  21 64.6 
  22 65.2 
  23 65.5 
  24 65.2


7950x with 4x32GB @ 3800MHz here. Getting similar results except my single thread performance is the worst (around 40), otherwise it looks almost identical. I assume my poor single thread perf is due to having capped max CPU core clock at 4.5Ghz. Getting 4 sticks to boot was a pain, haven't really bothered to change anything once I finally got it running stable.


7950x with 32GB @ 6000MT/s:

    1 52.6
    2 78.1
    3 71.0
    4 74.3
    5 71.3
    6 72.5
    7 70.0
    8 69.6
    9 68.4
    10 68.7
    11 68.5
    12 68.3
    13 68.3
    14 68.0
    15 67.8


For 2 x 3200 MHz, 51.2 GB/s is the theoretical limit. I guess your single core is good enough to use all bandwidth. How many RAM sticks do you have?


4 x 32GB for now. I need to investigate manual OC as EXPO doesn't work with all 4 slots populated. Another option is to try the 192GB 4x48GB Corsair kits at 5200.


It is a modified version of what we use at Yipit. I stripped out mostly things related to working with Django/Flask apps. It comes in very handy when making sure we maintain style and push fewer broken commits. Feel free to re-use.


Good point, I will have to swap out some testing libs but should be possible.


Thanks for sharing. Nice to see an alternative approach for defining these sorts of things.


Happy to - note that the syntax in the search test is mostly just shortcut (though I fully expect that to be the main interface). You can also use the objects for individual query types: https://github.com/elasticsearch/elasticsearch-dsl-py/blob/m...


Fixed. Completely missed that.


Thanks, checking this out.


Thanks for sharing this, didn't know about it. Always nice to see how other people tackle the same issue.


Putting together and composing queries in ES is error prone and I have grown quite allergic to libraries that just write off the query/filter API ES provides as "json blob lol lol".

You can lose massive amounts of time debugging a broken query in ES, due in no small part to a lack of an explicit spec.

To respond to the mis-aimed comment:

It's not about just beginning learning - it's a problem the moment you want to properly compose queries and not just have dumb templates.

I am not new to ES, I've been using it for years.

I understand wanting directness and the full breadth of the Elasticsearch API (thus moving away from Haystack), but not actually supporting anything in the API (which is the actual hard work, not wrapping an http-client) is problematic.


Are you referring to the drive to use more of a languages' built in operators to build queries, instead of my approach? I think the Mozilla library uses an approach like that.

http://elasticutils.readthedocs.org/en/latest/


Yeah that looks to be closer to what I'd like, although it's still not type-safe.


Agreed. I had a rough time when I was first learning how to manually build queries as there are few examples of complex queries.


We used to use Haystack, but found it a bit too opinionated for us once we wanted to do some custom stuff. It is a bit more faithful to the Django queryset API, something we had to abandon to let us use more of the complex Elasticsearch query features.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: