This blog post makes some good points about using vision models for retrieval, but I do want to call out a few problems:
1. The blog conflates indexing/retrieval with document parsing. Document parsing itself is the task of converting a document into a structured text representation, whether it's markdown/JSON (or in the case of extraction, an output that conforms to a schema). It has many uses, one of which is RAG, but many of which are not necessarily RAG related.
ColPali is great for retrieval, but you can't use ColPali (at least natively) for pure document parsing tasks. There's a lot of separate benchmarks for just evaluating doc parsing while the author mostly talks about visual retrieval benchmarks.
2. This whole idea of "You can DIY document parsing by screenshotting a page" is not new at all, lots of people have been talking about it! It's certainly fine as a baseline and does work better than standard OCR in many cases.
a. But from our experience there's still a long-tail of accuracy issues.
b. It's missing metadata like confidence scores/bounding boxes etc. out of the box
c. Honestly this is underrated, but creating a good screenshotting pipeline itself is non-trivial.
3. In general for retrieval, it's helpful to have both text and image representations. Image tokens are obviously much more powerful. Text tokens are way cheaper to store and let you do things like retrieval entire documents (instead of just chunks) and input that into the LLM.
(disclaimer: I am ceo of llamaindex, and we have worked on both document parsing and retrieval with LlamaCloud, but I hope my point stands in a general sense)
(disclaimer I am CEO of llamaindex, which includes LlamaParse)
Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.
Some quick notes:
1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.
2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.
3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.
The very first (and probably hand-picked & checked) example on your website [0] suffers from the very problem people are talking about here - in "Fiscal 2024" row it contains an error for CEO CAP column. On the image it says "$234.1" but the parsed result says "$234.4". A small error, but error nonetheless. I wonder if we can ever fix these kind of errors with LLM parsing.
I'm a happy customer. I wrote a ruby client for your API and have been parsing thousands of different types of PDFs through it with great results. I tested almost everything out there at the time and I couldn't find anything that came close to being as good as llamaparse.
Indeed, this is also my experience. I have tried a lot of things and where quality is more important than quantity, I doubt there are many tools that can come close to Llamaparse.
All your examples are exquisitely clean digital renders of digital documents. How does it fare with real scans (noise, folds) or photos? Receipts?
Or is there a use case for digital non-text pdfs? Are people really generating image and not text-based PDFs? Or is the primary use case extracting structure, rather than text?
How well does llamaparse work on foreign-language documents?
I have pipeline for Arabic-language docs using Azure for OCR and GPT-4o-mini to extract structured information. Would it be worth trying llamaparse to replace part of the pipeline or the whole thing?
wait do you have specific examples of "overengineering and overabstracting" from llamaindex? very open to feedback and suggestions on improvement - we've spent a lot of work making sure everything is customizable
Thanks for running through the benchmark! Just to clarify some things:
(1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.
(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?
We love the feedback, and one main point especially seems to be around making the docs better:
- Improve the organization to better expose both our basic and our advanced capabilities
- Improve the documentation around customization (from LLM's to retrievers etc.)
- Improve the clarity of our examples/notebooks.
100%, if the API itself can choose to call a function or an LLM, then it's way easier to build any agent loop without extensive prompt engineering + worrying about errors.
You still have to worry about errors. You will probably have to add an error handler function that it can call out to. Otherwise the LLM will hallucinate a valid output regardless of the input. You want it to be able to throw an error and say I could produce the output given this format.
Hi all - Jerry (co-founder/CEO) here, here to help answer any questions you might have!
We're building a data framework to unlock the full capabilities of LLMs on top of your private data. We can’t wait for the future - this space is moving so rapidly and there’s so many things we want to do on both the open-source and enterprise side.
Feel free to shoot me a personal note on Twitter/Discord as well.
depending on what questions you're asking, you could check out LlamaIndex query capabilities - can define different index structures for different queries + can plugin to your langchain workflow: https://gpt-index.readthedocs.io/en/latest/use_cases/queries...
ColPali is great for retrieval, but you can't use ColPali (at least natively) for pure document parsing tasks. There's a lot of separate benchmarks for just evaluating doc parsing while the author mostly talks about visual retrieval benchmarks.
2. This whole idea of "You can DIY document parsing by screenshotting a page" is not new at all, lots of people have been talking about it! It's certainly fine as a baseline and does work better than standard OCR in many cases.
a. But from our experience there's still a long-tail of accuracy issues. b. It's missing metadata like confidence scores/bounding boxes etc. out of the box c. Honestly this is underrated, but creating a good screenshotting pipeline itself is non-trivial.
3. In general for retrieval, it's helpful to have both text and image representations. Image tokens are obviously much more powerful. Text tokens are way cheaper to store and let you do things like retrieval entire documents (instead of just chunks) and input that into the LLM.
(disclaimer: I am ceo of llamaindex, and we have worked on both document parsing and retrieval with LlamaCloud, but I hope my point stands in a general sense)