Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Using grammar constrained output in llama.cpp - which has been available for ages and I think is a different implementation to the one described here - does slow down generation quite a bit. I expect it has a naive implementation.

As to why providers don't give you a nice API, maybe it's hard to implement efficiently.

It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.

So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.

And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.



A guide on llama.cpp's grammars (nine hours and not a single mention of "GBNF"? HN is slipping) is here:

https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

There's also a grammar validation tool in the default llama.cpp build, which is much easier to reason about for debugging grammars than having them bounce off the server.


If your masking is fast enough, you can make it easily work with spec dec too :). We manage to keep this on CPU. Some details here: https://github.com/guidance-ai/llguidance/blob/main/docs/opt...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: