Using grammar constrained output in llama.cpp - which has been available for ages and I think is a different implementation to the one described here - does slow down generation quite a bit. I expect it has a naive implementation.
As to why providers don't give you a nice API, maybe it's hard to implement efficiently.
It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.
So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.
And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.
There's also a grammar validation tool in the default llama.cpp build, which is much easier to reason about for debugging grammars than having them bounce off the server.
As to why providers don't give you a nice API, maybe it's hard to implement efficiently.
It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.
So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.
And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.