More

mattcollins · 2025-10-28T09:31:41 1761643901

FWIW, I ran a test comparing LLM accuracy with TOON versus JSON, CSV and a variety of other formats when using them to represent tabular data: https://www.improvingagents.com/blog/is-toon-good-for-table-...

I've only looked at one model (gpt-4.1-nano) so far. I'm hoping to run similar tests on some other models but it gets challenging to discern statistically significant differences with better models as their accuracy tends to be a lot better across the board.

mattcollins · 2025-10-29T11:50:53 1761738653

Results from some further tests here: https://www.improvingagents.com/blog/toon-benchmarks

mattcollins · 2025-10-16T14:41:40 1760625700

This is a follow-up to previous work looking at which format of TABULAR data LLMs understand best: https://www.improvingagents.com/blog/best-input-data-format-...

(There was some good discussion on Hacker News around that here: https://news.ycombinator.com/item?id=45458455)

We often want to feed NON-TABULAR data to LLMs, though, such as typical API responses or config files.

This new work looks out how the format of such nested / hierarchical data affects how well LLMs can answer questions about it; specifically how several models get on with JSON, YAML, XML and Markdown.

mattcollins · 2025-10-14T15:32:55 1760455975

Here you go: https://www.improvingagents.com/blog/best-input-data-format-...

mattcollins · 2025-10-06T08:56:33 1759740993

Author here.

This has made me chuckle several times - thanks!

mattcollins · 2025-10-05T19:21:20 1759692080

I'm the person who ran the test.

To hopefully clarify a bit...

I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.

__mharrison__ · 2025-10-06T03:18:45 1759720725

Can you expand on how you did this?

mattcollins · 2025-10-06T08:47:49 1759740469

I did a small test with just a couple of formats and something like 100 records, saw that the accuracy was higher than I wanted, then increased the number of records until the accuracy was down to 50%-ish (e.g. 100 -> 200 -> 500 -> 1000, though I forget the precise numbers.)

mattcollins · 2025-10-05T18:59:42 1759690782

I'm the person who ran the test.

To explain the 60% a bit more...

With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.

For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.

padolsey · 2025-10-06T05:13:04 1759727584

Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...

As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.

Based on these limited tests, here's the leaderboards on formats FWIW:

    CSV: 84.25%
    Markdown Table: 82.65%
    YAML: 81.85%
    JSON Lines (jsonl): 79.85%
    Markdown key-value: 79.83%
    Pipe-delimited: 79.45%
    Natural language summary: 78.65%
    JSON: 77.73%
    HTML table: 75.80%
    XML: 73.80%

So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1

And if you have no control over model, then use CSV or Markdown Table.

ysleepy · 2025-10-05T20:41:39 1759696899

Wouldn't it be more useful to measure the number of rows the model can process while still hitting 100% accuracy?

rovr138 · 2025-10-06T00:21:00 1759710060

> As you increase the size of the input data, the accuracy gradually decreases.

Interesting.

On your section "Limitations and Areas for Further Study",

What I'd be curious on future work would be,

    - changing the order of the data on each table type
    - changing the order of the questions

I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.

Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?

Good idea

CuriouslyC · 2025-10-06T03:34:28 1759721668

LLMs have documented position biases, with skew towards first and last. This is strongest in messages due to system prompt + current question training data, but it's present in list data in general.

rovr138 · 2025-10-06T07:15:11 1759734911

Exactly. But the papers I’ve seen, the tests are done based on answers being multiple choice usually.

    Where do you eat?
    A) floor
    B) table
    C) dirt

In this case, the questions asked have an answer. The bias would then be on the order of the input data. It’s different enough that it triggered my curiosity.

CuriouslyC · 2025-10-06T16:06:11 1759766771

https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638...

Redster · 2025-10-06T17:32:26 1759771946

Thank you for including the tokens needed for each test.

It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.

mattcollins · 2025-10-05T18:38:55 1759689535

I'm the person who ran the test.

The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.

[I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]

mattcollins · 2025-07-02T15:33:43 1751470423

"This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard."

https://blog.cloudflare.com/control-content-use-for-ai-train...

mattcollins · 2025-07-02T15:27:59 1751470079

I wondered about this, too.

Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.

mattcollins · on Sept 26, 2024

I noticed that, too. It does seem 'odd'.