I’ll be interested to see benchmarks. My expectation is that accuracy will take a hit on mid or longer context prompts: I’d bet that the heavy use of JSON in fine tuning will end up impacting quality of a more terse (less reasoning space) novel encoding.
I've only looked at one model (gpt-4.1-nano) so far. I'm hoping to run similar tests on some other models but it gets challenging to discern statistically significant differences with better models as their accuracy tends to be a lot better across the board.
Do you mean the [0] Token Benchmarks section? I only see token count numbers.
Which doesn't address the question: do LLMs understand TOON the same as they would JSON? It's quite likely that this notation is not interpreted the same by most LLM, as they would JSON. So benchmarks on, say, data processing tasks, would be warranted.
I would assume the next iterations/fine-tuned variants of current models would reach similar accuracy for TOON as they do for JSON.
The current models unfortunately do not have TOON in their training set, so they would probably require additional input tokens to grok the notation, and even then probably won’t have the same accuracy as they do for JSON.
That said: I like the idea!